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ABSTRACT 

The purpose of this dissertation is to provide 
evidence bearing on the question of the influence of the grades 
students receive on their ratings of the college teachers who gave 
thea those grades. Specifically, certain characteristics of the grade 
distribution within each course section are evaluated as predictors 
of the students^ ratings of the teacher of that course section. 
Multivariate techniques were employed to evaluate data across an 
entire university. Over 30,000 anonymous student rati'ngs of 2,360 
course sections were collected after students had received final 
course grades, and without student or administrator knowledge that 
the ratings would be used in the study. Factor analysis was used to 
reduce the eight-item rating instrument to a single criterion 
variable. Subsequently, stepwise multiple regression analysis was 
used, both to reduce an initial battery of predictors to an optimally 
reduced subset, and to test the incremental importance of certain 
grading variables as predictors of the criterion,. The primary 
implication of the study is that there is a relationship between 
grades and ratings, but it only accounts for about nine percent of 
the variance in the ratings. There is a significant bias, but factors 
other than grades must also be influencing student ratings. Whether 
or not these other factors are valid measures of teaching 
effectiveness remains to be determined, but one seemingly invalid 
factor (the grading bias) has been identified. (Author/RC) 
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ABSTRACT 



FACULTY RATINGS ANIV STUDENT GRADES: 
A LARGE-SCALE MULTIVARIATE ANALYSIS BY COURSE SECTIONS 

David Lile Brown. Ph.D. 
The University of Connecticut, 1974 

♦ • 

Effective teaching has been a primary educational goal 
throughout history, and the evaluation of teaching is 
therefore one of the central concerns of education. Student 
ratings are often considered one of the best indicators of 
teaching effectiveness, because the student is in the most 
privileged position to view the teaching and to experience 
its effects. The use of such ratings has become widespread, 
b t controversy rages over whether student ratings are valid 
ir asures of teaching effectiveness, especially when used for 
making decisions about faculty pay, promotion, and tenure. 

It is possible that students are not mature enough to 
appraise teaching or, worse, that they may be exploiting 
rating systems to punish strict teachers and to reward 
lenient ones. If so, t\is would have important implications 
for the interpretation of student ratings. Proper 
interpretation requires an answer to the research question: 
What is the influence of the. grades students receive or 
their ratings of the college teachers who gave them those 
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grades? This question has not been answered satisfactorily 
by previous research. The former evidence has been 
conflicting and inadequate in^ a number of ways. The 
present study, however, was intended to avoid many of the 
shortcomings of previous studies. 

This study employed multivariate techniques to evaluate 
data across an entire university. Over 30,000 anonymous 
student ratings of 2,360 course sections were collected 
after students had received final course grades, and without 
studer.t or administrator knowledge that the ratings would 
be ur.ed in this study. Factor analysis was used to reduce 
the tiight-item rating instrument to a single criterion 
variable. Subsequently, stepwise multiple regression 
analyses were used, borh to duce an initial battery of 
predictors to an optimally re ' . ad subset, and to test the 
incremental importance of certain grading variables as 
predictors of the criterion. 

Results showed that the simple correlation between the 
average student grade in each course section and the average 
student rating of the teacher of that course section was 
.35, £ < .000001. Moreover, the average grade was the 
single best predictor, of those available, of the average 
rating,., -md when average grade was added to the optimally 
reduced .subset of other predictors, it significantly 
imprcvi^d the multiple correlation from .25 to .39, 
F (4, 2343) = 60.13, £ < .001. 

The primary implication of the results of this study 
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is that there is a relationship between grades and ratings, 
but it only accounts for about 9% of the variance in 
ratings. In other words, there is a significant bias, but 
factors other than grades must also be influencing student 
ratings. Whether or not these other factors are valid 
measures of teaching effectiveness remtLins to be determined, 
but one seemingly invalid factor (the grading bias) has 
been identified through this study. The interpretation of 
student ratings should take this bias into account, and 
methods should be devised to eliminate it. 
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CHAPTER I 

STATEMENT AND DEFINITION OF THE PROBLEM 
Statement of the Problem 

Considerable recent controversy has centered aroiind 
faculty evaluation methods , especially student ratings of 
college teachers (Centra, 1973; Doyle & Whitely, 1974; Frey, 
1974; Zelbj , 1974). According to several researchers 
(Bausell & Magoon, 1972a; Capozza, 1973; Carrier, Howard, & 
Miller, 1974; Centra, 1973; Costin, Greenough, SMenges, 
1971; French-Lazovik, 1974; Menzie, 1973; Tref finder & 
Feldhusen, 1970), the use oi student ratings to evAlviate the 
faculty has become so widespread that it is now almost taken 
for granted at institutions of higher learning. Controversy 
rages, however, over the validity of such ratings, especially 
concerning "the trend toward "formal, quantitative use of the 
results of the evaluations in determinations of faculty • 
promotions and salaries" (Zelby, 1974, p. 1267). 

Faculty reactions to student evaluation of teaching 
range from acclaim to outrage. Supporters point to the need 
for evaluation and to the potential that ratings may have 
for improving education. Opponents emphasize that students 
may not be proper judges, or that rating instricnents may not 
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ask students the right questions. 

The whole issue is complex, but especially this 
question of validity. Are student ratings a valid measure 
of teacher effectiveness? The answer depends, of course, on. 
definitions of validity and of teacher effectiveness. Since 
there are no universally accepted definitions of these, 
progress toward the development of such measures is impeded. 
Nevertheless,' attempts at progress commonly are made with a 
rather vague notion of what "effective teaching" means. 

If factual learning alone were the objective of 
effective teaching, one could validly measure various 
teachers' effectiveness through achievement testing. Indeed, 
many would support such measures as extremely valid, but 
others would irsist that the "real goal in teaching is to 
impart philosophical values or to inculcate a special 
attitude toward learning rather than to simply help the 
student t.o master che subject matter" (Frey, 1974, p. 47). 
Still others would be inclined to consider the promotion of 
factual learning primary, but in combination with .certain 
affective factors. How effective, for example, is the 
teacher in drawing students' attention and affection? Or 
how are the students made to feel about the specific subject 
matter they are learning? Does the teacher make it 
i cresting? 

Whatever view one takes of effective teaching, the 
major issue relevant to student ratings is still validity: 
Are student ratings a valid measure of effective teaching? 

14 
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Some would argue that student ratings are valid because the 
student, as Lne conswier, is iu the best position to view 
the teaching and to experience the effects of the teaching 
(see Centra, 1974.: Costin, Greenough, & Menges, 1971; Frey, 
1974; McKeachie, 1969). 

Others would argue, for various reasons, that student 
ratings are not valid as a measure of teaching effectiveness. 
For example, LeComte (1974), Peck (1971), and St.Onge (1974) 
express strong doubts that students are in any position to 
judge the scholarship, offered by the faculty. According to 
Costin et al. (1971), student ratings are typically 
chiillenged on the grounds "that student ratings are 
unreliable, that the ratings will favor an entertainer over 
the instructor who gets his material across effectively, 

m 

that ratings are highly correlated with expected grades (a 
hard grader would thus get poor ratings) and that students 
are not competent judges of instruction since long-term 
benefits of a course may not be clear at the time it is 
rated" (p. 511). 

One of the most important points of contention is 
whether or not students can be objective enough for their 
ratings to be valid. Even though student ratings are 
necessarily subjective opinions, they could still be valid 
measures of effective teaching if students were able to 
identify effective teaching (or at least some component or 
components thereof) , and if their ratings were unbiased by 
irrelevant variables. Some believe, however, that students 
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may bias their ratings in favor of lenient teachers, or at 
least in favor of the teachers in whose courses they get the 
highest grades. If so, this would have important 
implications for the interpretation of such ratings. It is 
even likely that such ratings would be at odds with larger 
educational objecti^-3s. For instance, "given a specific 
format, it is possible to adapt one's teaching technique to 
obtain a good or bad evaluation and ... a good evaluation 
may be associated with a teaching technique of lesser 
educational value than a poor evaluation" (Zelby, 1974, p. 
1268). 

Therefore, one may pose the research question: What is 
the influence o^ the grades students receive on their 
ratings of the ^ lege teachers who gave them those grades? 

Purpose of the Study 

The purpose of this study is to provide evidence 
bearing on the research question. Specifically, certain 
characteristics of the grade distribution within each course 
section are evaluated as predictors of the students' ratings 
of the teacher of that course section. The following are 
the grading variables : 

1. Average of the grades in the course section, 

2. Standard deviation of the grades, 

3. Variance of the grades, 

4. Skewness of the grades, and 

5. Kurtosis of the grades. 

16 
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Stepwise multiple regression analysis is employed to 
evalui^t^ these grading variables for their effectiveness as 
predictors, when used in combination with certain other 
predictors of such faculty ratings. These other predictors 
constitute an "optimally reduced subset" of the following 
variables: 

1. Sex, 

2. Appointment length, 

3. Percentage employed, 

4. Class size, 

5. Number of years uenured, 

6. Age, 

7. Years since hiring, 

8. Course level, 

9. Title level, 

10. Course location, 

11. Department quant itat iveness , and 

12. Rate (percentage) of return of ratings. 

An "optimally reduced suht Bt" is defined as the subset which 
had the lowest standard error of estimate during a prior 
stepwise multiple regression analysis. 

, The hypothesis of this study is that the grading 
variables are strongly related to student ratings of college 
teachers, and that, they will significantly improve the best 
prediction of those ratings attained by the other available 
predictors. 

17 
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Need for the Study 

The evaluation of teaching is one of the central 
concerns of education. Effective teaching has been a 
primary educational goal throughout history. As mentioned 
above, one commonly used me' >are of effectives > teaching is 
evaluation by the student. It would be hoped that student 
ratings would be valid and that they would promote effectivs 
teaching through feedback to teachers and administrators. 
However, it is not certain that students are mature enough, 
wiae enough, or objective enough to evaluate such 
instructional behavior without bias. Are student ratings , 
valid measures of effective teaching? Or are they more or 
less a "payoff" in a sort of game between teachers and 
students, in which teachers reward (or punish) students with 
grades, and students respond in kind with ratings? 

The conflicting evidence to date has not satisfactorily 
answered this question. The extent and the importance of 

the relationship between grades and ratings have not been 

determined. Neither have any causes of such a relationship 
been positively identified, partially because the • 
relationship itself has not been firmly established. 

Former evidence on this question has been not only 
conflicting, but also inadequate in a m-riher of ways. The 
results of previous studies will be sutjs^c.ized in chapter 
II but most of these studies had one or more of the 
following potential, shortcomings : 
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1. A relatively small sample size; 

2. Sampling of only one or a few departments; 

3. Sampling of only graduate assistant level teachers; 

4. Sampling of team- taught courses; 

5. Samp'^.ing of only freshmen level courses (or some 
other level) ; 

6. Contamination by voluntarism (using only teachers 
who volunteer to be rated) ; 

7. Contamination by sensitization (a new system is 
initiated, or a class is upset by €. researcher suddenly 
appearing on the scene) ; 

8. Contamination by knowledge of the study (students, 
faculty, or administrators know ahead of time that the 
ratings will be used in an "experiment") ; 

9. Contamination by lack of anonymity protection for 
the students; 

10. An indefinite treatment (students did not actually 
know what their final grade in the course would be at the 
time of the ratings) ; and 

11. A lack of multivariate analysis (often only simple 
correlations among a few variables are reported, whereas 
multivariate techniques permit simultane >us examination of 
the influence of many variables and pre ide evidence of the 

"extent as well as the nature of any such influence" (Doyle & 
Whitely, 1974, p. 260)) . 

This present study does not suffer from any of the 
above potential shortcomings, and it should provide better 
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evidence bearing on the research question than that 
previously fpund. 

A further need for this study is institutional. While 
a somewhat similar study was conducted at the University of 
Connecticut ten years ago, no such study has been done 
recently. Changes have occurred in the rating instrument 
since that study of 10 years ago (Garber, 1964), and 
improvements have been made in the administrative procedures 
connected with the collection of ratings. There is rer jon 
to believe the data are less, subject to coll ction .rors 
now than formerly. 

Furthenaorci , one p^iT;li^-.ps more important ly , . over the last 
10 years, grades at tha. University of Connecticut have taken 
a sharp turn upward (see Figures 1-3 below) , Whatever "thie ' 
relationship between grades and. ratings may have been 10 
years ago, that: relationship well may have changed in light 
of the changes in grading, or in light of certain other 
changes related to* the rise of student power in general. 
The advent, in the fall of 1968, of both pass-fail options 
and liberalized policies regarding the dropping of courses 
may have influenced the relationship between grades and 
ratings in some way. The appointment of students to 
formerly all faculty committees is another recent change 
which could possibly bear upon the research question: 

Whether or not the trend in grading has had any effect 
on the relationship between grades and ratings, it is a 
striking trend and worthy of attention. Figure 1 shows the 

20 



trend over the last 25 years of the quality point ratios 

(QPR's) of graduating seniors. The QPR is the grade average 
computed as the sum of the quality points (number of credits 
in a course times the grade in that course, where "A" = 4, 
„gM ^ 3^ u^u ^ 2, "D" = 1, and "F" = 0) divided by the total 
number of credits. At the University of Connecticut, QPR's 
are multiplied by 10 and range from 0 to 40, but for the 
sake of clarity to readers outside of the University of 
Connecticut, the range used throughout this study is the 
more prevalent 0 to 4. Figure 1 demonstrates that there was 
a great deal of consistency from 1950 until 1967 or 1968, 
but that ^the trend has been upward ever since. Table 1 
provides the data used in Figure 1. 

Figure 2 supplies further evidence ,,of the trend in 
grading at the University of Connecticut, showing the median 
QPR s of all undergraduates after each semester. Note that 
aacli spring semester is represented by the year marks along 
the X-axis of Figure 2, while each fall semester is at the 
half-way point between year marks. The upward trend in 
median QPR's indicated by Figure 2 started about 1964, which 
is, as one would expect, three or four years before the 
start of the upward trend in graduating seniors' QPR's as 
shown in Figure 1. The data for Figure 2 are given in Table 
2. Data for the spring semester of 1970 are missing because 
of a student strike near the end of that semester, following 
the U.S. bombing of Cambodia. In many courses that semester 
final exams were cancelled and grades of "S'' for 
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Figure 1. Quintile QPR cutoff points for graduating seniors, 1950-1974. 
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Figure 2. Median QPR's of all undergraduates after the end of each semester, 1952-1974. 
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Table 1 



Data 


for Figure 


1, QPR Cutoff 


Points for 






Graduating 


Seniors, 1950 


-1974 




Graduating 




Percentiles : 




Class of 


80th 


u w U if 




20th 


1950 


2.80 


2.49 


2.26 


2.04 


1951 


2.83 


2 . 49 


2.24 


2.02 


195^ 


2. 80 


2.46 


2.21 


i;99 


1953 


2.85 


2.52 


2.28 


2.06 


1954 


2.81 


2.50 


2.23 


2.03 


1955 


2.81 


2.44 


2.20 


2.00 


1956 


2.84 


2.51 


2.24 


2.00 


1957 


2.84 


2.49 


2.23 


2.00 


1958 


2. 82 


2.48 


2.23 


1 . 93 


1959 


2.83 


2.48 


2.21 


2.00 


1960 


2.82 


2.48 


2.22 


2.08 


1961 


2.81 


2.48 


2.24 


2. 03 


1962 


2. 85 


2.^7 


2.24 


2.04 


1963 


2.83 


2.49 


2.25 


2.04 


1964 ' 


2.81 


2.43 


2.25 


2.05 


1965 


2.83 


2.50 


2.26 


2.04 


1966 


2.83 


2.50 


2.27 


2.06 


1967 


2.88 


2.55 


2.30 


2.03 


1968 


2.89 


2.54 


2.30 


2 . 06 


1969 


2.94 


2.59 


2.36 


2.10 ' 


1970 


.3.04 


2.70 


2; 44 


2.17 


1971 


3.11 


2.79 


2.52 


2.25 


1972 


3.19 


2.89 


2.62 


2.33 


1973 


3.30 


2.99 


2.72 


2.43 


1974 


3.32 . 


3.02 


2.77 


2.47 



Note. The minimum QPR for graduation (0th percentile) was 



1. 80 every year. 
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Table 2 

Data for Figure 2, Median QPR's for 
All Undergraduates, 1952-1974 



Year Spring Semester Fall Semester 



1952 


2.27 


2.21 


1953.. 


2.28 


2.19 


1954 


2.27 


2.21 


1955 


2.29 


2.24 


1956 


2.31 


2.24 


1957 


2.30 


2.24 


1958 


2.28 


2.20 


1959 


2.23 


2. 18 


1960 


2.31 


2.23 


1961 


2.29 


2.27 


1962 


2. 19 


2.25 


1963 


2.18 


2.28 


1964 


2.35 


2.26 


1965 


2. 37 


2.35 


1966 


2.45 


2.43 


1967 


2.48 


2.45 


1968 


2.56 


2.53 


1969 


2.57 


2.58 


1970 




2.72 


1971 - 


2.87 


- 2.82 


1972 


3.02 


3.04 


1973 


3.03 


2.88 


1974 


3.02 





Note . Data for the spring semester of 1970 and the. fall 
semester of 1974 (current semester) are missing. 
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Table 3 

Data for Figure 3, Average of 
All A-Through-F Grades, 1961-1974 



Year 


Spring Semester 


Fall Semester 


1961 




2.23 


1962 


2.29 


2.26 


1963 


2.31 


2.27 


1964 




2.26 


1965 


2.34 


2.33 


1966 


2.42 


2.40 


1967 


2.44 


2.48 


1968 


2.56 


2.55 


1969 


2.64 


2.59 


1970 


3.18 


2.73 


1971 


2.82 


2.79 


1972 


2.89 


2.86 


1973 


2.89 


2.80 


1974 


2.92 





-Note. ~Data--for the following semesters-arB_ missing:^ sp.ring__ 



of 1961, spring of 1964, and fall of 1974 (current 
semester) . 
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"Satisfactory" were given. Some courses did have finals, ' 
and some teachers used regular grades with' or without final 
exam scores. However, the median QPR for all undergraduates 
was not computed by the administration because of the 
drastic effect of the strike on grades. That effect is 
clearly indicated in Figure 3. 

Figure 3 shows the trend in the average of the "A- 
through-F" grades given each semester. Again, the upward 
trend since 1964 is clearly indicated. The inclusion, 
starting with the spring semester of 1967, of 300' s 
(graduate) level courses would account for some of the rise 
ill. this curve , Mit it "sKouTd have" resulted in a one-time — — 
jtfflip. The constant upward trend can not be explained by the 
inclusion of these courses, which was unavoidable because of 
a change in administrative procedure. - The data for Figure 3 
are shown in Table 3. 

There is no indication that the ability level of the 
students at the Universiuy of Connecticut has followed (or 
preceded) such a drastic trend as that indicated by Figures 
1^3 (personal communications with the Admissions Office) . A 
study by Baird and Feister (1972) found that a large number 
of faculties tended to give out the same distribution of 
grades over the years 1964-1968, even regardless of chariges 
in the ability leve' of the students in some cases. The 
grading trend found at the University of Connecticut, 
however, is quite the opposite (higher grades without any 
p :ch increase in student ability levels). This trend 
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seems to be part of a nationwide trend toward leniency 
occurring since the period of the Baird and Feister study. . 
Recent reports involving some 23 campuses across the nation 
("Cal State," 1974; Ladas , 1974; "Too Many A's," 1974) have 
supplied evidence that a "grade glut has been spreading 
across 'aca"dem^'" ("Too Many A' s , "1974 , p . 106)" in the last 
few years. 

It is not certain whether teachers are "simply being 
generous" or "are bribing students with good grades to get 
good grades themselves" ("Too Many A's," 1974, p. 106). But 
there is definitely an increased leniency on the part of 

this faculty , and it -is possible '-that ..J:he_. relation ship^ , 

between grades and ratings is not the same as it was 10 



years ago. Furthermore, it is likely^ that the vpward trend,, 
in. grading has not been universal across departments (cf . 
Ladas, 1974; Postman, 1974) nor across all teachers and .that, 
as a result, the influence of grades, if any, on ratings 
would be more pronounced for some teachers and for some 
departments than for others. 

Thus there are multiple reasons for conducting this 
study. The research question has not been satisfactorily 
answered; previous studies have suffered from several 

short comings"; and "change s in the ratings ins t rument and- , • - 

grading practices have occurred at this university (and 
apparently nationally). The present research is an attempt 
to provide better and more up-to-date knowledge concerning 
faculty evaluation by students. 



32 



1 o 

V 

Sources of the Data 

This study uses di^ta from the spring semester of. 1973 
at the University of Connecticut. Some of the data'were 
obtained from admissions records or other branches of the 
university administration and were collected prior to ir73. 
The evaluation data were collected by the Bureau of 
Institutional Research in the customary manner established 
by that office and followed each spring. Detailed 
descriptions of the variables investigated and the rating 
instrument used appear in chapter III. 

No one, including the students and the administrators 
who gathered the data, knew that the data would be used In 
this present study. Grade data were obtained from the 
registrar's records. Since ratings are done anonymously, 
it is not possible to match the individual students' racings 
with their grades. Nevertheless, a great deal of inforr?ation 
about the distribution of grades in each course section is 
known and can be matched with a large amotint of other data 
about the course section and its teacher. Thus, the 
separate course sections (N = 2,360) were the units of 
analysis for this study. The loss of the ability to match 
grades and ratings for individual students is offset by 
certain compensations, such as the anonymity of the ratings 
data (which is what precludes the matching) and the ability 
to study data across an entire university. 
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CHAPTER II 
. SUMMARY OF RELATED RESEARCH 

This chapter presents a summary of the theories and 
research which form the background to this study. The 
research related to the major issue of the Validity of 
student ratings is covered first, followed by a summary of 
the research on several related issues. 

Validity of Student Hatings 

In 1971, Costin, Greenough, and Menges- conducted a 
review of the literature on student ratings of college 
teachers. They stated that faculty members could judge the 
validity of student ratings "to the extent students' 
subjective criteria match the faculty members' goals in 
teaching" (p. 513), and thus, a determination must be made 
of "the basis on which students make their judgments" (p. 
514). In their discussion of various possible bases for 
student judgments, Costin et al. cited conflicting evidence 
as to whether "students may judge instruction on the basis 
of its 'entertainment' value rather than on its information 
contribution to learning, or long-term usefulness" (p. 517) 
They also stated that students may make rating judgments 
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based on the grades received or expected in,' the course. 
This is the issue upon which this present study is focused. 

Co.stin et al. reviewed 28 studies of the relationship 
between student ratings and grades. Of those, 15 found no 
significant relationship, 1 found a small negative 
correlation, and 12 fo^and significant, but small, positive 
correlations . Typically the correlations did not exceed 
.30, but they could be considered collectively as evidence 
that the true relationship may be low and. positive . 

In an earlier study at the University of Connecticut, 
Garber (1964) used student ratings (responses to the eight 
separate items on the .then current rating • instrument) to 
predict "difference scores" (the differences between the 
students' grades and their "expected grades"). Current 
grade point averages, were used as the expected: grades . 
Garber obtained a multiple correlation of .43; correction 
for shrinkage yielded an R of .39 (a highly significant 
multiple R, £ < -001). This outcome was for the multiple 
regression analysis using teachers as units of analysis. 
Using students as units of analysis, Garber obtained 
similar results (R = .31; correction for shrinkage yielded 
an R of .28,^ £ .001). He concluded\that the two behaviors 
student ratings of their teachers, and the direction in 
which the teachers tended to. grade the students (higher or 
lower than their current grade point averages) were either 
directly or indirectly dependent on each other. 

Bausell and Magoon (1972a) , using a much larger sample 
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than most other studies (N =12,000 ratings, from which 
random stratified samples were drawn), "found stijong, 
consistent biases in both instructor and course ratings 
which can be traced to (a) the grade the student expects to 
receive, and (b) the discrepancy between the student's 
expected grade and his GPA" (p. 1021). They concluded that 
student ratings of instructors are "axiomatically valid for 
their designed purpose, but must be interpreted with 
caution" (p. 1022) since the assignment of low grades may 
be proper in many cases, but would result in lower ratings. 
The implication is that administrative use of ratings for 
pay, promotion, and tenure decisions may .require great 
caution. 

A study by Treffinger and Feldhusen (1970) used 
characteristics of the students to predict student ratings 
of their teachers. They concluded that student variables, 
including grades received, were not significant predictors. 

7 

They found that generalized pre -course ratings were most 
often the best predictor of end-of -course ratings, and 
therefore they concluded that the "student's rating of the 
course is clearly a complex interaction of his initial 
feelings, certain cognitive and affective characteristics 
of teachers and pupils, and instructor performance" (p. 622) 
Treffinger and Feldhusen noted that "it should be the 
instructor's performance or ability as a teacher and 
.students' reliable perceptions and evaluations of th.Ti 
performance which constitute the majority of the v.^^ iance 
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in instructor ratings" (p. 622). They suggested that the 
difference between pre- and post-course ratings might be 
more useful than post-course ratings. But other research 
(especially Holmes, 1972) indicates that such a difference 
may be very heavily affected by changes in grades from 
initial expected grades to end-of -course expected or actual 
grades. 

Bausell and Magoon (1972b) found that "rating changes 
did occur as a function of .change;s in grade expectancy" 
(p. 10). Holmes (1972) conducted an experiment in which 
expected grades were "disconf irmed. " "Half of the students 
who deserved and expected A's or B's were given their 
expected grades, while half were given a grade one step 
lower than expected" (p. 130). Correct grades were given 
after ratings were collected. He concluded that "if 
students' grades disconf irm their expectancies, the students 
will tend' to deprecate the instructor's teaching performance 
in areas other than his grading system " (p. 130, emphasis 
added) . These results support a theory that students are 
not objective; that tnc^y use ratings as a "payoff;" and that 
ratings might therefore be invalid. 

Many studies have investigated the relationship between 
ratings and achievement rather than grades. Fnch studies 
are, of course, directly related to the valic / issue since 
it is obvious that high achievement constitutes a primary 
goal of effective teaching. If students rate highest the 
teachers from whom they learn the most, then student ratings 
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would have a certain validity. Furthermore, such studies 
often provide important insights into the relationship*" 
between ratings and grades insofar as achievement and grades 
are closely interrelated. 

One^ such study (Rodin and Rodin, 1972) istirred a great 
deal of interest in the relationship between ratings and 
grades. The. authors found a significant negative 
correlation between amount learned and ratings of the 
teachers (teaching assistants in one large undergraduate ' 
calculus course with 12 sections) . Controversy was stirred 
over their conclusion" that "perhaps students resent 
^Instructors who force them to work too hard and to learn 
more than they wish'* (p. 1166). 

/ Another investigator (Capozza, 1973) agreed. He 
concluded that "the hypothesis that students give good 
ratings to classes in which they l^.arn a great deal must be 
rejected. Emotional fact.ors such as grades and perhaps a. 
distaste for the hardships of learning seem to bias the 
results in the" opposite direction" (p. ' 127) .-^ Though his 
sample (250 students in eight course sections) was larger 
than many, Capozza himself considered it small. 

Gessner (1973) found results opposite those of the 
Rodins and Capozza. He criticized the Rodin and Rodin study 
for its use of teaching assistants, whose instructional* role 
was apparently an ancillary one:' Gessner's own data, 
however, were based on only one course (N = 78), and it was 
team- taught. There can be no generalization, then, across 
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courses or instructors. 
. ^- A study, by Potter, Nalin^ and Lewandowski (1973) 
employed a better research design than others (with students 
and teachers randomly assigned to classes). Unfortunately, 
however, teacher trainees were used. They found a modest, 
but significant , correlation between ratings and achievement 
when the effect of aptitude was partialled out. Although 
their sample was small (N . = 254 students in 24 course 
sections), they observed that the magnitude of the 
correlation between achievement and ratings seemed' to be 
lower when the correlation bv3tween achievement and aptitude 
was. higher-, and vice versa. They concluded: "the stronger 
the relation betweeii ^aptitude and achievement, the less^the 
relation between achievement and rating" (p. 2). A 
corollary of this conclusion might be that ratings may . 
suffer from a grading bias in direct proportion to how 
arbitrary the,^ grading is (cf. Holmes, 1972). 

There are two conflicting explanations often proposeci 
for a negative correlation betweei;! achievement and ratings': 
(a) better learners are more critical of teachers or (b) 
there is resentment of the hard work teachers force students 
to do in order to achieve. Rodin (1974), however, indicated 
that we may no't Heed either of these theoretical 
explanations, since' the true rej-^^tion ship well may be 
positive. Shereviewed several studies and found that most 
of the correlations between ratings and amount learned , 
tended "to lie in the range r == .20 to r = .30 . . . which 
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indicates that amount learned accounts for only about 4 to 9 
percent of the variance in student ratings" (p. 5). Rodin' 
concluded that student ratings are not indicative of amount 
learned. She compared student ratings of teaching to 
consimier ratings of the palatability of peanut butter, .and 
argued that, just because the constamer ratings may be poorly 
related to laboratory ratings of nutritiveness, the validity 
of palatability ratings is rs^t thereby impugned (unless one 
expects the palatability ratings to tell us all about peanut 
butter). Rodin implied, therefore, that student ratings may 
be a valid measure of one component of effective teaching, 
and that grades could be one important influence on that 
component • 

Related Issues 

Several issues, which have been hinted at above, are 
directly related 'to validir One is the possible presence 
of a "halo-effect." That is, are global impressions at 
work, masking the separate traits of effective teaching? 
Do the individual items on rating instrximents provide ujeful 
information, or do they form large clusters of traits? Such 
clusters of traits could be valid measures of .teaching 
effectiveness even if tae separate items or traits were not. 

Strong halo effects were noted by Royce (1960), Garber 
(1964)^ Potter, Nalin, and L^wandowski (1973), and Widlak, 
McDaniel, and Feldhusen (1973), Hoyt (1969) found a halo 
effect in student ratings of courses and concluded that it 
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interfered with students' ability to discriminate fairly 
among the traits of the course covered by the rating 
instnunent. Costin, Greenough, and Menges (1971), however, 
cited several studies in support of a theory that ratings 
on multiple attributes of instruction are low in halo 
effect. 

Wldlak et al\ reviewed a representative sample of 22 
studies over the past 30 years. They concluded that, while 
most rating instruments can be factored into at least two 
components, there is likely to be a halo effect "so strong 
. . . that the specific item ratings may have little 
diagnostic value in assessing a teacher's strengths and 
weaknesses and, ultimately, low potential for improving 
teaching" (p. 10). Widlak et al. found three factors 
throughout the studies they reviewed*. They called these the 
"Actor," "Interactor," and "Director" factors (referring to 
roles of teaching) . These might also be called 
"Performance," "Rapport with students," and "Course 
structure/difficulty" factors. 

Another issue, somewhat related to the halo-effec 
issue is the persistence of first impressions or 
preconceptions. The findings of Bausell and Magoon (1972b) 
suggest that first impressions and preconceptions are 
persistent. They found that ratings of teachers after the 
first day of classes were very durable, and that they were 
highly correlated (r = .67, n = 20 courses) with ratings 
taken at the end of the course. 
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As mentioned above, Tref finger and Feldhusen (1970) 
foxmd that generalized pre-course ratings were most often 
the best predictor of end-of -course ratings. Furthermore, 
Parent (1971) suggests that ratings should be taken early 
in the course to provide feedback to the teachers in time 
to modify the teaching performance before the courses were 
finished and "wasted." If early attitudes are persistent, 
then end-of -course ratings would not improve, but Parent 
suggests that end-of-course ratings should be eliminated* 
If ratings were always done very early in the course, some 
of the grading bias (if there were one) might be eliminated. 
Some bias might remain, however, since expected grades 
could still have an influence, as could grades on various 
quizzes and papers. 

On the one hand, the durability of first iii5)res8ion8 
might indicate that student ratings are invalidated, 
because they are not measures of teaching effectiveness 
over the entire course. On the other hand, it may be that, 
whatever student ratings are measuring, it simply does not 
take very long to evaluate it (especially, perhaps, the 
Actor factor). In the case of preconceptions, it could be 
that there is accurate foreknowledge about the course or 
instructor from word-of -mouth advice from other students, 
or from prior experiences with the instructor himself or 
with similar courses taught by' others. 

Cos tin et al. suggested that student ratings might be 
shown to be valid If there were a hlg^ correlation between 
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student and peer ratings (agreement between different 
judges). The results of the Bausell and Magoon (1972b) 
study suggest that ratings by outside observers' including 
other faculty members, would tend to correlate highly with 
student ratings, even though the outside observers might 
view only a few class sessions. Touq and Feldhusen (1973) 
supported this view, as do the results of four studies 
(r's from .30 to .63) cited by Costin et al. (1971). Centra 
(1974), however, found that while there is agreement between 
student and peer ratings of teachers, the peer ratings were 
"less reliable" (low interjudge agreement among the peers). 
He concluded that both student and peer ratings should be 
used, but to measure different traits. Costin et al. 
cautioned that peer ratings may be influenced by student 
ratings through hearsay (see also Jaeger & Freijo, 1974). 
It is possible, however, that a grading bias is responsible 
for the difference between student and peer ratings, or at 
least some portion of the difference. One implication is 
that peer ratings might be used to partial out the grading 
bias if one exists. 

Still another issue relates to the definition and 
measurement of effective teaching. It is the issue over 
whether or not the good researcher is likely to be a good 
tccxcher. Stallings and Singhal (1969) found a significant 
correlation between teacher evaluations and research output 
(measured by weighted combinations of the number of books, 
articles, etc.) . They concluded, as many others have, that 
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productive researchers are usually very good teachers, but 
that students often have the stereotype that very active 
researchers neglect their teaching. If pervasive, such a 
stereotype would tenr^ o lower the validity of student 
ratings of researcncrr £ nee the ratings might be based on 
a misconception. i.'-.Dani^i and Feldhusen (1970) found that 
students did bias, their ratings against heavy researchers 
or prolific T^7riters ; they theorize that the students may 
be correct — that researchers do neglect their teaching. 

Many studies (notably Aleamoni, 1974; Bendig, 1952; 
and iilmore and LaPointe, 1974) have examined relationships 
between faculty ratings and certain characteristics of the 
teachers or courses. There was, according to them, little 
consensus about the characteristics of the "best" or "worst" 
teachers or classes. At the University of Connecticut, 
however, administrative findings over the past few years 
indicate there are definite differences in ratings across 
several variables. Historically, students at this 
university rate male teachers slightly higher than female 
teachers overall, but not on every item of the scale. In 
general, the smaller the class size, the higher the ratings 
of the instructor have been. Similarly, classes at the 
branches have been rated higher than courses at the main 
campus at Storrs, Connecticut. Officials have noted, 
however, that this last difference ma^^ be due to the fact 
that the largest course sections are taught at Storrs. 

Historical evidence also indicates that aL the 
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University of Connecticut ; the more advanced the course, 
the higher the ratings. Tenured faculty have received/ / 
better ratings than non-tenured teachers. With respect to. 
age, a teacher's ratings scaem to, peak during the 31-to-40 
age period, with ratings rising steadily to that point and 
then declining only slightly and very slowly afterwards. 
Thus, variables such as age, sex, tenure status, and class 
size might be good predictors of student ratings. 

Summary 

A careful scrutiny of the literature reveals that very 
few multivariate studies have been conducted in the area of 
this problem, and that several criticisms (mentioned in 
chapter I) may be leveled at. most of the simpler, but more 
numerous , correlational studies . Furthermore , there is 
disagreement among the researchers, and directly conflicting 
results from their studies have been confounded by the use 
of many different research methods, different rating 
instruments, and different sample characteristics (including 
different' units of analysis) . In sum, the research question^ 
has not been answered satisfactorily, though several 
interesting corr.cepts and theories have emergec^ which could 
possibly prove helpful, 6hce better evidence is obtained. 



CHAPTER III 
PROCEDURE 

This chapter describes the subjects, the rating 
instrument, the predictor variables, the criterion variables 
and the statistical methods used in this study. Appearing 
in the appendix are macro flowcharts of the computer 
programming steps required to assemble the data for analysis 

Subjects 

The !'subjects" in this study should be considered to 
be the 2,360 course sections for which complete data could 
be assembled. Since only 89 course, sections were deleted 
for missing data, this study covered 96.37% of the entire 
population of 2,449 course sections evaluated following 
the spring semester of 1973 at the University of Connecticut. 
Over 30,000 rating forms were returned, representing about 
55% of the students who were sent them. 

Instrument 

The University of Connecticut Rating Scale for 
Instruction (UCRSI ; Bureau of Institutional Research, 
University of Connecticut , 1971) is presen^ied in Figure 4. 
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Figure 4. UNIVERSiTY OF CONNECTICUT RATING SCALE FOR iNSTRUCT40N 
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The UCRSI is the instriiment currently used at the university 
in a program of faculty evaluation that had its origin in 
the late 1940' s. It started with a University Senate 
concern with the quality of teaching. It. was felt that 
there might be an overemphasis on the "publish or perish'' 
ethic. 

At first, the university evaluated only certain "target" 
teachers (those for whom the administration requested 
information, usually for career decisions). But by th? late 
I960's, the process had grown into one that covered every 
faculty member teaching a "ratable" course section. The 
nonratable course sections are those deemed not ratable in 
the normal way, such as seminars, independent study courses, 
field work, practice teaching, team- taught courses, graduate 
assistant-taught courses, and others. The decision as to 
whether or not a specific course section is ratable is made 
by the individual department chairmen each year. 

After students have received their grades for the 
spring semester, and after they have moved to their summer 
addresses, the rating forms are mailed to the students with 
their teachers' names and their course sections already 
filled in (see Figure 4), The university pays for postage 
both ways , and the anonymity of the students is guaranteed 
since students' names or identification numbers appear 
nowhere on returned forms. The rate of return of ratings 
has remained around 55% over the recent years. 

Ratings are done only after each spring semester. This 
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is done for two reasons: (1) so that students xd.ll fill out 
rating forms at their homes in the summer away from the 
influences of their fellow students, and (2) especially so 
that the cost of evaluation will not be doubled (by 
repeating the evaluations each semester). 

Results are machine tabulated and sent both vto the 
individual teachers and to their department chairmen. 
Cumulative average scores (cumulative since 1969) accompany ^ 
each teacher's ratings for the recently ended course 
sections. Since the Bureau of Institutional Research edits 
each of the incoming rating forms by hand (including 
checking for the clarity of markings) , and since invalid 
forms are removed, high accuracy is claimed for the ratings 
data . 

The ratings data are provided to promotion and tenure 
committees by the department chairmen, along with other 
pertinent information. There has been some concern at the 
university' that someone might attribute significance to very 
small differenc^es in ratings, when in fact, only extreme 
differences in ratings have, any practical meaning. There * 
is also some concern about the validity of the UCRSI. There 
is no certainty as to what it measures, although university 
of f icials ' have noted that the ratings seem to have a 
moderately high "reliability." That is, large samples of 
student ratings show a high degree of agreement with each 
other. What the students are agreeing about, however, is 
not absolutely clear. 

c 
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Initial Predictor Variables 

The initial battery of predictors of student ratings 
of college teachers included twelve variables. Presented 
here are briet descriptions of each of these predictors and 
of how each was derived. 

Sex . The sex of each teacher was obtained from his or 

her professional history , filed with the administration. 

No cases we^e lost for missing data for this variable. The 

X^characters M and F, for male and female, were converted to 
\. - ■■ ■ 

values 1 and 2 respectively before computation began. 

\ Appointment length . The number of months of each 

teacher's^ appointment was also obtained from the 

professional history. . In most cases, teachers are appointed 

^- "\ ■ ■ ■ 

each year for a term of 9 months. A few have 10 mo* v L 

appointments, and several are for 11 months. The values 9, 

10, and 11 were used for this variable, and no cases werp. 

lost for missing data. 

Percentage employed . Most teachers at the University 

of Connecticut are employed full-time, regardless of their 

appointment length. Several are part-time, however, and 

the percentages vary. .Values from 1 to 100, representing 

percentages of full-time employment, were obtained from the 

professional history, and no cases were lost for missing 

data . 

Class size . The number of students enrolled in each 
course section at the end of the semester was obtained from 
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the grade distribution data. It equals the total number of 
all grades given, including incompletes, etc. No cases were 
lost for missing data for this variable. 

Niamber of years tenured . The year that a teacher was 
tenured .(if he wr^s) ^v^s obtained from the professional file. 
The nxamber of y^?' j since being tenured was computed (zero 
if not tenured) by subtracting the tenure year from 1973. 
No data, were missing. 

Age . The birthdate of each teacher was obtained from 
the .prof essional history file, and each teacher's 
chronological age in completed years was computed as of 
May 8, 1973 (the end of the spring semester). 

Years employed at the University of Connecticut . The 
hiring date of each teacher was also obtained from the 
professional history. The number of years of continuous 
service at the university was computed as of May 8, 1973 
(the end of the spring semester) for each case. None of the 
cases were lost otl account of missing datd for this variable. 

Course level . Course numbers were obtained from the 
rating responses records. The level of the course was then 
computed by dividing the course number by 100 and dropping 
the digits to the right of the decimal point. The/resulting 
values (0, 1, 2, 3, and 4) repres^t courses with numbers 
1-99, 100-199, 200-299, 300-399, and 400-499 respectively. 
No cases were lost for missing' data for this variable. 

Title level . Each. teacher ' s classification cade was 
obtained from the professional file. A list of the 



classification codes belonging to each title level 
(Instructor, Assistant Professor, Associate Professor, and 
Professor) was created with the aid of personnel from the 
Bureau of Institutional Research for use in onfe of the 
merging computer programs. Title levels (values 1, 2, 3, 
and 4 for Instructor, Assistant Professor, Associate 
Professor, and Professor respectively) were , assijgned to 
each teacher according to his or her classification cbde.^ 
Two course sections were lost since one teacher's 
classification code could not be correctly grouped with any 
one level. 

Course location. A branch code was obtained for each 
course section from the rating responses data. A value of 
l.was assigned to the course location if the course was 
taught at anjy of the branches (Hartford, Stamford, 
Southeastern, or Hartford M.B.A.), and a value of 2 was 
assigned to course sections taught at the Storrs main 
campus. No cases were lost for missing data for this 
variable. " , 

Department quantitativeness . Student records for 
juniors and seniors (majors are not declared earlier) 
inclucied the major department and his two Scholastic 
Aptitude Test Scores (Verbal and Quantitative). For each 
department in-which students declared a major, the average 
quantitativeness (DQj) was computed as the -average 
difference in Quantitative and Verbal SAT scores of the 
students majoring in that department (j). The formula 
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used to compute the average quantitativeness for each 
department was the following: 

\ 

' =. ^(SAT^ij - SATV^j) / n^ ' ^(i) 

where, in the j th""dep.artment , SATQ ^^, is the ith student's 
Quantitative SAT score, SATV.j is the ith student's Verbal 
SAT score, and n^ is the ntimber of students majoring in that 
department. The results of these computations are presented 
in Table 4. Since some departments had very few L*:udents 
as majors, the quantitativeness figures for those 
departments should be interpreted cautiously. With small 
n's, those departments may have radically different 
quantitativeness figures from one year to the next. 

Note also that departments high in quantitativeness 
have high positive values (maximiam of 169.5), and 
departilients low in quantitativeness have low or negative 
values (minimum of -135.0). Being very high or very low 
(or anywhere between for that matter) on this scale of 
quantitativeness. does not signify anything about the quality 
of the department or its average SAT scores. One could 
argue either way that it is better to be highly "verbal'' 
or highly- /'quantitative." Yet for suggesting a profile, 
the scale appears to be very meaningful; it has high face 
validity. * 

There are 10 departments , in which no., undergraduate 
students may major (Aerospace R. 0 . T . C . , Biobehavioral 
Science , (general) Engineering , Interdepartmental , 
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Table 4 

Mean Differences Between Quantitative and Verbal SAT Scores 
of Undergraduates with Different Majors 



Department 

Name Mean ^ n S.D. 



Stacistxcs 


169.50 


4 


49 


. 10 


Civil Engineering 


119. 99 


101 


73 


. 10 


Mathematics 'I 


106.04 


144 


87 


.04 


Chemical Engineering 


105.42 


26 


80 


. 99 


Mechanical Engineering 


95.89 


35 


71 


. 18 


Electrical. Engineering 


94-64 


78 - 


96 


. 26 


Accounting 


82.23 


140 . 


81 


. 77 


Pharmacy 


80.93 


45 


84 


. 94 


Finance ' t 


. 71.60 


102 


84 


. 64 


Business 


68.06 


190 . 


87, 


95 


Physics 


• 64.68 


19 o 


70; 


.33 


Italian 


64.50 


2 


46. 


. 50 


Geography 


62.25 


^ 12 - 


70. 


, 35 


Ehysical Education £. 


61.47 


87 


72 




Geology 


60.83 


24 


87. 


42 


Chemistry 


59.00 


43 


82. 


62 


Agricultural Engineering 


58.50 


8 


84. 


80 


Industrial Administration 


57.55 


60 


86. 


65 


Agricultural Economics 


56.33 


3 


42. 


87 


Marketing 


52.44 


94 


81. 


47 


Pre -Veterinary 


49.00 


13 


89. 


70 


Nutritional Science 


43.00 


7 


76. 


93 


Biology 


41.86 


345 


92. 


16 


Horticulture 


41.07 


99 


82. 


83 


Animal Industries 


39. 98 


49 


85. 


01 


Economics 


31.99 


101 


77. 


50 



(Continued on the next page) 
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' Table 4 (Continued) 



Department 

Name Mean n S.D. 



Agriculture and 
Natural Resources 


29.00 


1 


0 


Spanish 


22.82 


38 


94.36 


Physical Therapy 


19.25 


181 


90.33 


Political Science 


18.85 


189 


84.80 


Foods & Nutrition 


17.86 


22 


69.38 


Family Economics 


17.57 


7 


76.56 


German 


16.57 


14 


, 119.31 


Clothing, Textiles, & 
Interior Design 


15 79 


Oil 


y 7 . z 1 


Speech 


14.66 


110 


88.50 


Cnild Development & 
Family Relations 


14.10 


174 


81.78 


Psychology 


13.49 




7 J . OO 


His tory 


13.37 




Q/i no 


I4u8 ic 


10.56 




yj . DU 


Medical Technology 


10.29 


35 


84.86 


*5ociology 


10.16 


225 


89.27 


Education 


9.60 


354 


81.45 


Art 


6.11 


82 


77.10 


Philosophy 


-3.46 


33 


79.87 


Nursing 


-4.24 


193 


76.47 


Anthropology 


-5.30 


57 


86.03 


French 


-17.50 


47 


73.60 


Dramatic Arts 


-23.02 


47 


104.45 


Russian 


-29,71 


7 


127.18 


English 


-33.09 


401 


88.84 


Classics 


-135.00 


1 


0 
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Journalism; Linguistics. Metallurgy. Polish. Portuguese, and 
Science). The 56 course sections that were -taught in these 
10 departments were excluded from .this study because of 
missing data for department quantitativeness . ' ' 

Rate of return of ratings. For each course section, 
the percentage of rating forms returned by the students 
was computed. The number of ratings returned (obtained 
from the ratings data) was divided by the enrollment in 
the class (see class size). Values for rates of return 
were permitted to range from .001 to 1.000, and no cases 
were dropped on account of missing data. 

Grading Variables 

Average grade . The number of each type of grade given 
in each course section was obtained from the grade 
distribution data (from the registrar ' s records). Grades 
such as "I" for Incomplete and "P" for Pass were ignored, 
and the average of the "A-through-F- grades was computed 
for each course section (values permitted from 0.00 to 
4.00). There were 27 cases deleted on account of missing 
data for this variable. In these 27 cours.e sections, no 
grades were given in the "A- through-F" range. Note that 
these course sections differ from those in which all "F's" 
were given (resulting in an average grade of 0.00). and 
thus the 27 cases could not logically be included in the 
study with any particular value for an average grade. 



4. 



Standard deviation of grades . Variance of grades . 
Skewness of grades , and Kurtosis of grades . These 
characteristics of the distribution of the "A- through-F" 
grades in each course section were computed at the same 
time as the average grade, using the registrar's grade 
distribution data. No further cases (beyond the 27 cases 
lost for average grade missing data) were lost for missing 
data for these variables. 

Criterion Variables 

Ten potential criterion variables were computed for 
each course section using the rating responses data. The 
process of selecting the criteria used in the stepwise 
multiple regression analyses is explained in the next 
section and in the next chapter. Descriptions of the 10 
potential criteria and how they were derived are presented 
here. 

Items 1-8 , The average rating on each of the eight 
items on the University of Connecticut Rating Scale for 
Instruction (UCRSI. see Figure 4) was computed for each 
course section. Four course sections (in which no one 
answered Item 8) had to be deleted for missing data. 

Average of Items 1 through 8 , The average of the 
responses to all of the items on the UCRSI was computed for 
each course section using the ratings data. No cases were 
dropped on account of missing data for this variable (none 
beyond the four dropped for Item 8 missing data). 
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Average of Items 1 through 7 . The average of the first 
seven items on the UCRSI was also computed from the rating 
responses data, and there were no missing data. 

Statistical Analyses 

In order to determine the number and nature of 
components or dimensions underlying the, rating instrument 
items, two principal components factor analyses (of 1972 and 
1973 ratings data) were conducted prior to the main effort 
of this study. The results 'of these (details presented in 
chapter IV) were instrumental when it came to making 
decisions about the choice cf a criterion variable. 

One stepwise multiple regression analysis was employed 
to reduce the initial battery' of 12 predictor variables 
(predicting the average of Items 1-8 on the UCRSI) to an 
'optimally reduced subset" (see chapter I). The results of 
this regression analysis are presented in chapter IV and 
discussed in chapter V. As far as the procedure is 
concerned, this regression analysis provided a multiple 
correlation and an ordered, optimally reduced subset of 
predictors, both of which were needed as input to following 
steps. 

A second stepwise multiple regression analysis was. 
performed using the optimally reduced subset found in the 
first regression analysis and the five grading variables 
(see chapter I) to predict the same criterion (average of 
Items 1-8). First, the variables in the optimally reduced 
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subset entered this regression equation in the same order 
that they entered the prior regression equation (i.e., 
ordered). Then, the grading variables were allowed to enter 
the regression equation in the order of theirx ability to 
improve the equation (i.e., floating). This served to test 
the incremental importance of ^ the grading variables as 
predictors of ratings, given that certain other variables 
were already in the regression equation. Thus, the 
variables in the optimally reduced subset, in effect, 
"partialled out" a chunk of the variance before the grading 
variables were even considered. 

The results; of the second' regression analysis also 
detemiineJ the relat:ive importance of the grading variables 
and the overall ability of the available predictor variables 
to predict ratings. These results are also presented in 
chapter IV and discussed in chapter V. 

Cross-va: 1 dation of the multiple regression analyses 
was si.MalaLed- through an, examination of the ep-timated amount 
of shrinkage in the multiple correlations using McNemar's 
(1962, p. 184) shrinkage forn^.ula: 

r; = {1- . (i . r2) r(N - 1) / (N - n)]}'^ ' (2) 

where P' is the multiple correlation after shrinkage, N is 
the* .sample size, n is the nvimT^er of predictor variables, and 
Is the multiple correl^t-ion squared. 

In order to tesn the significance of the increase in 
the aultiple correlation when cne grading variables were 



added to the optimal subset of other variables, an F-test 
of significance was performed using another of McNemar's 
- (1962, p. 284) formulae: ^ 

(Ri^ - Rj^: / (m, _ ni2) 

I " = (3) 

(1 - Ri^) / (N - mi - 1) 

i 

where Ri ^ is the larger multiple correlation squared, 
is the smaller multiple correlation squared, N is the sample 
size, mi is the number of predictor variables associated 
with Ri , and m2 is the number of predictor variables 
associated with R2 , with degrees of freedom mi - m2 , and 
N - mi - 1. . 

The significance of the multiple correlations by 
themselves was determined with F-tests of significance using 
yet another of McNemar's (1962, p. 283) formulae: 

F = (RW m) / [(1 - r2) / (N - m - 1)] (4) 

where R^ is the multiple correlation squared, m is the 
number of predictor variables included in the multiple 
correlation, and N is the sample size, with degrees of 
freedom m and N - m 1. • 

The significance of the individual simple correlations 
was determined using a significance table for correlation 
coefficients. 



60 



Stimmary - 

Student ratings of college teachers at the University 
of Connecticut during the spring 1973 semester were studied 
to determine whether or not the addition of five new ^ 
predictor variables dealing with grades could significantly 
improve an optimal set of predictors reduced from an initial 

battery of predictors. Factor analysis was used to„ rediice 

the eight-item rating instrument to a single criterion 
variable. Stepwise multiple regression analysis was. used, 
both to reduce the initial • battery of predictors to an \ 
optimally reduced subset , and , to test the incremental 
importance of the grading variables as predictors of 
average ratings. 
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CHAPTER IV 

RESULTS ^ 

This chapter presents the results of the statistical 
an^ilyses of this study. The procedures used to obtain 
these results are explained in chapter III, and a discussion 
of these results is-^rovided in chapter V. 

Factor Analyses of the Rating Instrument's Items 

Principal components factor analyses of two separate 
years' ratings (1972 and 1973) yielded nearly identical 
results. These results, presented in Table 5, show that 
all eight items on the University of Connecticut Rating 
Scale for Instruction (UCRSI) loaded heavily (.778 or 
greater) on a single factor. Item 8, "Over-All Summary As 
A Teacher," had a factor loading of over .97 both times. 
Furthermore, the correlations among the eight items, 
presented in Table 6, were all +.52 or greater and highly 
significant (£ < .000001). It should be noted that these 
correlations are among course section averages. 

The results of these preliminary factor analyses 
indicated that there was essentially one, global dimension 
underlying the ratings data, and that a single criterion 
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Table 


5 




Factor Analyses of the Eight Items on 
University of Connecticut Rating Scale for 


the 

Instruction 




1972 


1973 




Factor 1 


Factor 1 


XT 

Eigenvalue 

Percentage of Variance 


2417. 
6.030 
75.372 


2465 
6.011 
75.139 


Items: 


Factor 


Factor 




Loadings: 


Loadings : 


1. Knowledge of Subject 


.778 


^ .783 


2. Presentation of Material 


. .889 


.895 


3. Balance of Breadth & Detail 


.887 


• .902 


4. Enthusiasm for Subject 


.852 


.832> 


5. Fairness in Marking 


.836 


.824 


6. Attitude Toward Student 


.853 


.852 


7. Personal Mannerisms 


.867 


.863 


8. Over-All Summary as a Teacher 


.970 


.971 
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Table 6 , 

Product-moment Correlations Among the Eight Items 
on the University of Connecticut 
Rating Scale for Instruction for 1972 and 1973 





Items : 




1 




2 




3 




4 




5 




6 


7 


1. 


Knowledge of 
Subject 




























2. 


Presentation 


( 


. 66 


























of Material 


.66) 
























3. 


Balance of 
Breadth & Detail 


c 


. 69 
,66) 


( 


88 
.87) 




















4. 


En t Hu a i a sm 
fox Subject 




74 
72) 




70 

.70) 


(' 


.68) 
















5. 


Fairness in 
Marking 


(. 


55 
56) 


(. 


63 
64) 


(. 


,68 
68) 


(• 


,58 
, 64) 












6. 


Attitude Toward 
Student 


(. 


52 
54) 


(. 


66 
65) 


C 


66 
64) 


('. 


66 
70) 


C 


.81 
.81) 








7. 


Personal 
Mannerisms 


(.' 


55 
56) 


(. 


73 
74) 


c 


74 
74) 


('. 


63 
67) 


C 


,71 
70) 


(.' 


78 
77) 


/ 


.8. 


Over-All Summary 
as a Teacher 


(. 


74 
74) 


(. 


90 
89) 




89 
87) 


C 


79 
81) 


(.' 


77 
77) 


C 


81 
81) 


.82 
(.83) 



Note. 1972 correlations are in parentheses; N's for 1972 



and 1973 were 2417 and 2465 respectively; all 
correlations are significant, £ « .000001. 
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variable representing that factor could be employed in the 
subsequent stepwise multiple regression analyses. The 
average of all eight items was se.lected as the primary 
criterion for this study, but results were also obtained 
for three other criteria considered possibly representative 
of the factor (see below, Parallel Results Using Other 
Criteria). One of these other criteria. Item 8 by itself , 
was chosen on account of its extremely high factor loading. 
The average of the other seven items was chosen as a 
criterion for purposes of comparison with the first two 
criteria, since all eight items had very high factor 
loadings. Item 5 by itself was also used as a criterion in 
order to determine how well the predictor variables could 
predict the students* ratings of "grading fairness." 

Reduction of the Initi/:^ Battery of Predictor Variables 

• ■- /' 

The means and standard deviations of the 27 predictor 
and criterion variables are presented in Table 7. They 

r 

were conr^iced as part of this first stepwise multiple 
regression analysis. It should be noted that these means 
and standard deviations were computed across the 2,360 
course sections and, therefore, that teachers are unequally 
represented according to the numtar of course sections they 
taught. 

Table 8 shows the cotrelations among the 27 predictor 
jand_j2riteri^^^ are the correlations that 

were computed by the stepwise multiple regression analysis 
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Table 7 

Means and Standard Deviations 
of the Predictor and Criterion Variables 



V 


Variable 


Mean 


S. 


D. 


1 


Sex 


1. 15 


• 


36 


2 


Appointment length 


9.04 


• 


28 


3 


Percentage employed 


98.73 


7. 


68 


4 


Class size 


22.48 


28. 


34 


5 


Number of years tenured 


4. 69 


6. 


99 


6 


Age 


41.31 


10. 


10 


7 


Years since hiring 


7.59 


7. 


45 


8 


Course level 


1. 79 


• 


79 


9 


Title level 


2. 65 


1. 


00 


10 


Coarse location 


1. 80 


• 


40 


11 


Department quantitativeness 


33.75 


43. 


35 


12 


Rate of return of ratings 


.57 


• 


16 


13 


Average grade 


2. 94 


• 


57 


1^ 


Standard deviation of grades 


• 72 


• 


32 


15 


Variance of grades 


. 61 


• 


44 


1^ 


Sl^cewness of grades 


2.60 


2. 


81 


17 


Kurtosis of grades 


13.62 


29. 


34 


18 


Item 1 


8. 1? 


1. 


13 


19 


Item 2 


6.91 


1. 


60 


20 


Item 3 


6.87 


1. 


47 


21 


Item 4 


8. 01 


1. 


29 


22 


Item 5 ^ 


7.61 


1. 


29 


23 


Item 6 


7.54 


1. 


46 


24 


Item 7 


7.52 


1. 


38 


25 


Item 8 


7.33 


, 1. 


45 


26 


Average of Items 1-8 


7.49 


1. 


20 


27 


Average of Items 1-7 


7.51 


1. 


1? 



Note. N « 2,360 for each variable. 



Table 8 

Product -moment Correlations 
Among the Predictor and Criterion Variables^ 





Variable 


1 


2 


3 




4 


5 




6 


i 


oex 


1 . uu 


- 0*^ 


- 07 

- . U / 




00 

. uu 


- 00 




Oft 
. UO 


2 


Appointment length 


- . U J 


1 oo 
1 . uu 


1 O 
- . X u 




00 

• uu 


- 09 
- . UZ 




00 
. U' ' 


o 


Jr ercencage emp loyea 




- 10 


1 no 

X . uu 




01 
. ux 


07 
. u / 




hi 

«. ux 




Class size 


oo 

. uu 


oo 

. uu 


- . Ui 


1 


oo 

. uu 


O/i 

- . U^f 




OQ 
. Uo 


c 

J 


iNO • OX years uenurea 


« OQ 


- 09 


07 
. u / 




OA 


1 00 
X . uu 




Aft 
• DO 


0 


Age 


Oft 

• uo 


00 
. UU 


01 
. ux 




Oft 
. uo 


Aft 

. uO 


1 
X 


00 
• uu 


/ 


•xears since nirxng 


- Ofi 


00 

. uu 


OQ 




. uu 


QS 




70 
. / u 


o 
O 


Course level . 




oo 
. uu 


- 01 
— . ux 




1 A 


OS 
. UJ 




' Oft 
. uo 


y 


11 u 1 e 1 eve i 


- 90 


- 0^ 

- . U J 


-1 1 

.XX 




OA 

. UH 


A9 
. uz 




Sft 
. oo 


10 


Course location 


O 0 

- . // 


o^ 
. Uo 


01 

- . ux 




Oft 
. UO 


• LD 




0^ 
. U^ 


11 


Dept. quant. 


oo 
- . uy 


09 

- . uz 


01 
• UX 




OA 


Oft 

. UO 




OA 

• Uu 


12 


Rate of return 


oc; 


- 01 
- . Ul 


09 
- . UZ 




oa; 
. uo 


oo 
• uy 




09 
. uz 


13 


Average grade 


- OA 


Oft 
• uo 


- 01 
" . UX 




1 A 
. XO 


0^ 




O'^ 
• uo 




ota. dev. or graaes 


OA 


- OS 

— • U J 


' - 01 

" . U X 




90 
. zu 


- 0^ 




07 
• u / 


15 


Variance of grades 


OA 


- OS 
- . U J 


01 
. ux 




1 A 

. XH 


- 0^ 
— . uo 




07 
. u / 


16 


oicewness or graaes 


09 


- 0^ 

- . U J 


01 
. ux 




SI 

. J X 


- 0^ 
. uo 




OQ 
. uy 


17 


Kurtosis or graaes 


09 


- 09 

— . Ujt . 


01 
. ux 




77 


- 0^ 
. uo 




07 
• u / 


io 




- 00 


. Ujt 


08 




. uu 


1 s 




1 6 

. X u 


19 


T ^ AM* O 

Item 


o<; 

• U3 


01 

» ux 


OS 
. . UJ 




09 
. uZ 


0^ 

• u^ 




OA 


20 


Item 3 


.05 


.00 


.05 




.04 


- , 00 




.06 


21 


Item 4 - 


.06 


.01 


.07 




.05 


.05 




.07 


22 


Item 5 


-.02 


.02 


.01 




.06 


- . 03 




.04 


23 


Item 6 


.02 


.03 


.;03 




.07 


.01 




.00 


24 


, Item 7 


.06 


.02 


•05 




.06 


-.03 




.07 


25 


Item 8 


.02 


.01 


.06 




.05 


.03 




.00 


26 


Average of Items 1-8 


.04 


.01 


.06 




.06 






.00 


27 


Average of Items 1-7 


.04 


.01 


.06 




.06 


.02 




.00 



Note . All r's > .04 (or ^ -.04) are significant, £ < .05. 
For r's > .10 (or < -.10), £ < .000001. 
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Table 8 (Continued) 





Variable 


7 


8 


9 


10 


11 




1 






- 19 
• IZ 


o n 
- . 20 


- . 22 


-.09 


.05 


2 




nn 


• UU 


- . 03 


. 06 


-.02 


-.01 


*J 


ATI ^ 11 onm 1 ovo/i 


no 


- . If^l 


• 11 


-. 01 


. 01 


-.02 


A 




- . uo 


1 A 

- . 14 


- . 04 


. 08 


-.04 


-.06 




WO • or years uenureQ 


O K 

. 


. 05 


. 62 


. 15 


.08 


^00 




Age 


7 n 
• / U 


no 
• Uo 


. 58 


-.03 


06 


.02 


/ 


lears since nxrxng 


1 « UU 


no 


. 61 


.10 


.07 


.00 


Q 
O 


uourse levei 




1 n A 
1. 00 


. 26 


.34 


-.01 


.05 


Q 


xiuie jLevei 


.01 


. 2o 


S A A 

1. 00 


. 30 


.06 


.01 




LrOur se locauLon 


1 n 

• lU 


OA 


O A 

. 30 


1. 00 


.04 


-.04 






07 


m 
- . 01 


. 06 


A J 

. 04 


1. 00 


.06 




KHue ox rewum 


• UU 


. U J 


A 1 

. 01 


-,04 


.06 


1.00 


1 




n9 


• ^0 


. 15 


. 31 


-.15 


.03 




o u Q • oev • or gr aoe s 


- • U J 


- . j2 


-. 15 


-.17 


. 15 


-.07 




\T a ^ 4 a n a a ^ ^ -y^ q a a o 
VaXl.aIiC6 OX gxaUcS 


- • U J 


A 7 


- . 14 


-.17 


. 19 


-.04 




O AV «M A A A f \ ^ O' O A A 0 

oKewness or graaes 


HA 


0 Q 
- . JO 


- . 11 


A A 

-.09 


. 18 


-.04 


1 / 


Kurtosis or graaes 


na 


O 1 

-. 21 


A C 

- . 05 


-.01 


. 10 


-.02 




T t- Am 1 


• 10 


• 11 


. 28 


. 02 


-.01 


.05 


1 Q 




• UJ 


• Lj 


. 09 


A ^ 

. 06 


-.09 


.01 


20 


Item 3 


. 00 


.16 


.08 


.04 


-.03 


.02 


21 


Item 4 


.06 


. 14 


.14 


.01 


-.12 


.00 


22 


Item 5 


-.03 


.16 


.Or, 


.08 


.02 


-.01 


23 


Item 6 


.01 


.19 




.10 


-.04 


-.03 


24 


Item 7 


-.03 


. 18 


. 06 


.08 


-.04 


.01 


25 


Item 8 


.04 


.15 


.12 


.06 


-.04 


.01 


26 


Average of Items 1-8 


. 03 


.18 


.12 


.07 


-.05 


.01 


27 


Average of Items 1-7 


.03 


.19 


.12 


.07 


-.05 


. 01 
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Table 8 (Continued) 





Variable 


13 


14 


15 


16 


17 


18 


1 

X 


oex 




A / 

. 04 


. 04 


. 02 


.02 


-.00 


9 


iippoxnunient lengtn 


. 08 


-.05 


-.05 


-.03 


-.02 


-.02 




Percentage employed 


- . 01 


-.01 


. 01 


.01 


.01 


.08 




Class size 


- . 16 


.20 


.14 


.51 


.77 


-.06 




No. of years tenured 


. 03 


-.03 


-.03 


-.03 


-.03 


.15 


0 


Age 


. 03 


-.07 


-.07 


-.09 


-.07 


.16 


/ 


Years since hiring 


. 02 


-.03 


-.03 


-.04 


-.03 


.16 


t 


L>our s e 1 eve i 


. 56 


-.52 


-.47 


-.38 


-.21 


.11 


Q 


jLxuie level 


. 15 


- . 15 


-.14 


-.11 


-.05 


.28 


iU 


Course location 


. 31 


-.17 


-.17 


-.09 


-.01 


.02 


1 1 


Dept . quant . 


- . 15 


.15 


. 19 


.18 


.10 


-.01 




rvaue OX reuurn 


• 03 


- > ^*7 


- . 04 


-.04 


-.02 


.05 




Average grade 


1. 00 


- .68 


-.65 


-.56 


-.33 


.14 


T /a 


ouC . aev . ox graces 


- . oo 


1. 00 


. 94 


. 78 


.44 


-.10 


T c 
13 


\^ar lance of grades 


- . 65 


. 94 


1.00 


.86 


.49 


-.08 


10 


Skewness of grades 


- . 56 


. 78 


.86 


1.00 


.82 


-.07 


1/ 


Kurtosls of grades 


- . 33 


.44 


.49 


.82 


1.00 


-.04 




xtem 1 


. 14 


- . 10 


-.08 


-.07 


-.04 


1.00 


19 


Item 2 


.29 


-.17 


-.16 


-.12 


-.06 


.67 


20 


Item 3 


.27 


-.16 


-.15 


-.11 


-.06 


.68 


21 


Item 4 


.23 


-.19 


-.18 


-.16 


-.09 


.73 


22 


Item 5 


.41 


-.20 


-.17 


-.14 


-.10 


.54 


23 


Item 6 


.43 


-.23 




-.20 


-.13 


.51 


24 


Item 7 


.32 


"'.IS 


-.17 


-.14 


-.08 


.54 


25 


Item 8 


.32 


-.19 


- . •'• 


•>15 




.74 


26 


Average of Items 1-6 


.:5 


-.21 




-.16 


-.09 


. 77 


27 


Average of Items 1-7 


.36 


-.21 


- . i? 


-.16 


- . 09 


. 77 
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Table 8 (Continued) 





Variable 


19 


20 


21 


22 


23 


24 


1 


Sex 


.05 


.05 


.06 


-.02 


.02 


.06 


2 


Appo:fntment length 


.01 


.00 


.01 


.02 


.03 


.02 


3 


Percentage employed 


.05 


.05 


.07 


.01 


.03 


. 05 


4 


Class size 


-.02 


-.04 


-.05 


-.06 


-.07 


-.06 


5 


No. of years tenured 


.03 


-.00 


.05 


-.03 


.01 


-.03 


6 


Age 


-.04 


-.06 


.07 


-.04 


.00 


-.07 


7 


Years since hiring 


.03 


.00 


.06 


-.03 


.01 


-.03 


8 


Course level 


.15 


.16 


.14 


.16 


.19 


.IB 


9 


Title level 


.09 


.08 


.14 


.06 


.05 


.06 


10 


Course location 


.06 


.04 


.01 


.08 


. 10 


.08 


11 


Dept. quant. 


-.09 


-.03 


-.12 


.02 


-.04 


-.04 


12 


Rate of return 


.01 


.02 


.00 


-.01 


-.03 


.01 


13 


Average grade 


.29 


.27 


.23 


.41 


.43 


.32 


14 


Std. dev. of grades 


-.17 


-.16 


-.19 


-.20 


-.23 


-.18 


15 


Variance of grades 


-.16 


-.15 


-.18 


-.17 


-.23 


-.17 


16 


Skevmess of grades 


-.12 


-.11 


-.16 


-.14 


-.20 


-.14 


17 


Kurtosis of grades 


-.06 


-.06 


-.09 


-.10 


-.13 


-.08 


18 


Item 1 


.67 


.68 


.73 


.54 


.51 


.54 


19 


Item 2 


i.oo 


.89 


.70 


.64 


.66 


.74 


20 


Item 3 


.89 


1.00 


.69 


.68 


.67 


.74 


21 


Item 4 


.70 


.69 


1.00 


.57 


.65 


.63 


22 


Item 5 


.64 


.68 


.57 


1.00 


.81 


.71 


23 


Item 6 


.66 


.67 


.65 


.81 


1.00 


.78 


24 


Item 7 


.74 


.74 


.63 


.71 


.78 


1.00 


25 


Item 8 


.90 


.89 


,79 


.78 


. 81 


.83 


26 


Average of Items 1-8 


.90 


.91 


.82 


.83 


.85 


.87 


27 


Average of Items 1-7 


.90 


.90 


.83 


.83 


.86 


.87 
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Table 8 (Continued) 





Variable 


25 


26 


27 



1 


Sex 


.02 


.04 




.04 


2 


Appointment length 


.01 


.01 




.01 


3 


Percentage employed 


.06 


.06 




.06 


4 


Class size 


-.05 


-.06 


- 


.06 


5 


No. of years tenxired 


.03 


.03 




.02 


6 


Age 


-.00 


- . 00 


- 


.00 


7 


Years since hiring 


.04 


.03 




.03 


8 


Course level 


.15 


.18 




.19 


9 


Title level 


.12 


.12 




.12 


10 


Course location 


.06 


.07 




.07 


11 


Dept . quant . 


-.04 


-.05 


- 


.05 


12 


Rate of return 


.01 


.01 




.01 


13 


Average grade 


.32 


.35 




.36 


14 


Std. dev. of grades 


-.19 


-.21 


- 


•21 


15 


Variance of grades 


-.18 


-•19 


- 


.19 


16 


Skewness of grades 


-.15 


-.16 


- 


.16 


17 


Ktirtosls of grades 


-.09 


-.09 




.09 


18 


Item 1 


.74 


.77 




.77 


19 


Item 2 


.90 


.90 




.90 


20 


Item 3 


.89 


.91 




.90 


21 


Item 4 


.79 


.82 




.83 


22 


Item 5 


.78 


.83 




.83 


23 


Item 6 


.81 


.85 




.86 


24 


Item 7 


.83 


.87 




.87 


25 


Item 8 


1.00 


.97 




.96 


26 


Average of Items 1-8 


.97 


1.00 


1 


.00 


27 


Average of Items 1-7 


.96 


1. 00 


1 


.00 
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subprogram of the Statistical Package for the Social 
Sciences (SPSS, Nie et al., 1970). These correlations 
were subsequently used by that subprogram to perform the ' 
regression analyses. It should be noted that these are, 
again, correlations among course sect .on averages. Also, 
because more cases were deleted for missing data for the 
regression analyses (more variables involved) than for the 
factor analyses, some of the correlations in Table 6 differ 
very slightly from the corresponding ones in Table 8. 

Given the large sample size, any of these correlations 
that exceeds .035 is significantly different from zero 
(£ < .05). Furthermore, for any r that exceeds .048, £ is 
less than .01; and if r exceeds .10, then £ is less than 
.000001. Thus one may be fai^lv certain that ev3n the 
relatively weak relationships found in this study were not 
attributable to chance variation. 

The first major finding, bearing on the research 
question, was the correlation between the average student 
grade in each course section and the average student rating 
of the teacher of that course section. This was found to 
be .35 (£ « .000001). The other correlations involving 
these two variables are particularly interesting. For 
example, a correlation of -.15 was found between the average 
grade in each course section and the quantitativeness of the 
department in which that course is taught. Also, the 
correlation between average grade and Item 5 on the rating 
instrtiment ("Fairness in Marking") ^as .41, and the 
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correlation between average grade "^nd Item 6 ("Attitude 
Toward Student''.^ V7as .43. Discussior of the implications 
of tiiese and. other results is withheld until chapter V, 

The results of t^e first stepwise multiple regression 
analysis are presented in Table 9- The initial battery of 
12 prediCL ?rs of ratings was reduced to an "optimally 
reduced subset" (see chapters I and III), This subset 
consisted of r.he first 10 variables shown in Table 9 (above 
the line of dashes).. The variables are listed in the rank 
order of their importance in improving the predictability 
of ratings by the regression equation. Also shown in Table 
9, for each variable, are the standard error of estimate 
after the variable's inclusion in the regression equation, 
the multiple R, , the increase in R^ over the previous 
step, the simple correlation with the criterion, and the 
F-value that signifies the importance of the variable to 
the regression equation as of the last step . 

The multiple correlation produced by the optimally 
reduced subset of 10 predictors was .25. Correction for 
shrinkage yielded an R of .24. Although this multiple 
correlation is not very large (and accounts far only 
slightly more than 6% of the criterion variance), it is 
highly significant, F (10, 2349) - 14.12, £<'.001. It 
should be noted that the first variable to enter the 
regression equation, "course level," accounted for over 
half of the criterion variance finally accounted for by 
the en Lire optimally reduced subset of 10 predictors. 
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Table 9 First Stepwise Multiple Regression Analysis-Reduction of the Initial Battery 
of Predictor Variables 





variaDie 


p 




in R^ 


J LuUUCli. U 

Error of 
Estimate 

JLU4U 


r 


F-Value 


Sienif. 


1 


Course level . 


.18162 


.03299 


A n A A A 

.03299 


1.18363 


.18162 


52.190 




2 


Title level 


.19699 


.03881 


.00582 


1.18032 


.12155 


31.695 




3 


Age 


.210^5 


.04429 


.00548 


1.17719 


-.00118 


19.545 


114 


4 


Sex 


.22995 


.05288 


.00859. 


1.17214 


.03509 


19.702 




5 


Percentage employed 


.23486 


.05516 


.00228 


1.17098 


.05653 


5.181 


* 


6 


Dept. quant. 


.23889 


.05707 


.00151 


1.17004 


-.05062 


4.775 


* 


7 


Class size . 


.24261 


.05886 


.00179 


1.16918 


-.05908 


3.772 


1 


8 


Appointment length 


.243t- 


.0593/ 


.00051 


i.ioyii 


.UUoo 


lion 


n. s. 


9 


Years since hiring 


.24431 


.05969 


.00032 


1.16916 


.03132 


2.065 


n.s. 


10 


No. of years tenured 


.24555 


.06029 


.00061 


1.16903 


.02543 


1.242 


n.s. 


11 


Course location 


.24603 


.06053 


.00024 


1.16913 


.06555 


.605 


n.s. 


12 


Rate of return 


.24605 


.06054 


.00001 


1.16938 


.00768 


.014 


n.s, 




't** £ < .001 




< .05 


n.8. £ 


> .05 
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Furthermore, even though all 12 predictor variables 
increased the multiple R somewhat, only the first 10 
reduced the standard error of estimate (optimality) . 

The F-values of the predictors as of the last step 
show the final significance of each predictor. They also 
demonstrate the fact that predictors may gain or lose 
significance when other predictors enter the regression 
equation. (The rank order of the F-values is not the same 
as the order of inclusion of the variables into the 
regression equation.) 

Addition of the Five New Predictor Variables 
to the Optimally Reduced Subset 

The results of the second stepwise multiple regression 
analysis are presented in Tabid 10. These results show that 
the average of ::he student grades in each course section was 
the single best predictor of the average rating of the 
teacher of that course sect^-on. Furthermore, the addition 
jf the grading variables to the optimally reduced subset of 
other predictors significantly improved the multiple 
correlation from .25 to .39, F (4, 2345) = 60.13, £ < .001. 
The variable "average grade" by itself accounted for nearly 
8.57o of the criterion variance (more than was accounted for 
by the entire optimally reduced subset of other predictors). 

A further indication of the importance of the grading 
variables is provided by the fact that the variable "average 
grade," when it entered the regression equation at step 



Table 10 Second Stepwise Multiple Regression Analysis-Additiolof the Five Grading 
Variables to the Optimally Reduced Subset 



KanK variaDic 




- 


"Increase 
in R^ 


Standard- 
Error of 
Estimate 


-Simple 

r 


-F-Valufe- 


-Signif . 


\jOurse levei 


18162 


03299 


.03299 


1.18363 


.18162 


1.333 


n.s. 


0 Title level 


.19699 


.03881 


.00582 


1.18032 


.12155 


29.842 


*** 


3 Age 


.21045 


.04429 


.00548 


1.17719 


-.00118 


11.949 


■kidt 


4 Sex 


,22995 


.05288 


.00859 


1.17214 


.03509 


17.323 




5 Percentage employed 


.23486 


.05516 


.00228 


1.17098 


.05653 


6.003 


** 


0 uepL» quant. 




05707 


.00191 


1.17004 


-.05062 


.002' 


n.s. 


7 riflQ^ size 


.24261 


.05886 


.00179 


1.16918 


-.05908 


3.878 


* 


8 Appointment length 


.24366 


.05937 


.00051 


1.16911 


.01186 


.177 


n.s. 


.9 Years since hiring 


.24431 


.05969 


.00032 


1.16916 


.03132 


3.167 


* 


in Mn nf vpflr? fpniired 


.24555 , 


.06029 


. 00061 


1.16903 


.02543 


3.298 




11 Average grade 


.38080 


.14501 


.08472 


1.11533 


„ .35333 


206.038 




12 Skewness of grades 


.38445 ■ 


.14780 


.00279 


1.11375 


-.15866 


3.191 


* 


13 Standard deviation 


.38469 


.14799 


.00018 


1.11386 


-.20510 


3.532 


* 


of grades 












3.168 




14 Variance of grades 


.38621 


.14916 

m ^ m m mma 


.00117 


.1.11334 


-.19352 


* 


15 Kurtosis of grades 


.38622 


.14916 


.00001 


1.11357 


-.09191 


.019 


n.s. 


***£ < .001 




< .01 




.05 n 


.S. £ > . 


,05 
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number 11, completely dominated the regression equation. 
It took over and rearranged the regression equation to such 
an extent that the variables "course level" and "department 
quantitativeness" lost nearly all of their significance as 
contributors to the final regression equation. The extent 
of the rearrangement is clearly indicated by the column of 
F- values in Table 10 compared to the same colimm in Table 9. 

The first 14 variables listed in Table 10 (above the 
lower line of dashes) produced the multiple correlation 
with the lowest standard error of estimate. That R was .39. 
Correction for shrinkage yielded an R of .38. ^'Tiile this 
is still not a very large multiple correlation (accounting 
for only about 15% of the criterion variance) , it is highly 
significant by itself, F (14, 2345) = i28.28, .001. The 

significance of the increase in the multiple correlation is 
described above. 

Parallel Results Using Other Criteria 

The ch^oice of the criterion variable for the above 
re^reLssion explained above (see Factor Analyses 

of the Rating Instrtnnent ' s Items). As mentioned above,, 
three other criteria, Item 8 by itself, the average of the 
other seven items, and Item 5 by itself, might represent 
the single ratings factor as well as the average of all 
eight items. In order to determine if the choice of the 
criterion was important, three more pairs of stepwise 
multiple regression analyses were performed. These analyses 

79 
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were run for the three other criteria exactly as they yete 
for the first criterion. A reduction of the initial b^^^itery 
of predictors was done first, and then the five grading 
variables were addad to the new optimally reduced subset in 
each case. 

The results of these analyses were nearly identical to 
those obtained with the original criterion. Using "Ite^^ 8" 
as the criterion, the reduction of the initial battery Ot 
predictors resulted a multiple correlation of ,23. The 
^ correction for shrinkage yielded an R of ,22, F (9, 235O) ^ 
13,01, 2. < .OOj.. ,The variable "appointment length" did ^ot 
make a significant contribution to this regression equa^^ion, 
tiiough it did for the other three criteria Thus, this 
optimally reduced subset of predictors of "It6 inclt^decJ 
only nine of the independent variables. 

The addition of the grading variables to this optiiJ^^lly 
reduced subset of nine predictors increased the multiple 
correlation to ,36, F (13, 2346) = 26,09, 2. < -001, The 
correction for shrinkage did not lower this R. The inc^^Sase 
in the multiple correlation was also highly significant, 
F (4, 2346) = 52.94, £ < .001, 

Similarly, when the average of the first seven iteH^^ 
on the UCRSI was used as thQ criterion, the reduction the 
initial battery of predictors resulted in a multiple 
correlation of .25, Correction for shrinkage yielded an ^ 
of ,24, F (10, 2349) = 14.44, £ < ,001. _ is optimally 
reduced subset of predictors included the ^ame 10 variables 
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as the subset of predictors of the average of all eight 
items, and there was only one minor difference in their 
ord6r: the variables "age'* and "sexV weye reversed although 
their respective F-values as of the last step were nearly 
identical in both analyses. 

The addition of the five grading variables in this case 
increc^sed the multiple correlation to .39, with the 
correction for shrinkage yielding an R of .38, F (14, 2345) 
= 28.72, £ < .001. The increase in the multiple correlation 
w^^3 also highly significant , F (4, 2345y/« 60.75, £ < .001. 

The use of "Item 5" as the criterion yielded slightly 
different results. The reduction of the initial 'battery of 

predictors produced a lower multiple R than it did .f o.r„_the 

other three criteria. However, it was still a highly 
significant .19, F (9, 2350) = 9.29, £ < .001, The V 
correction for -shrinkage did not lower this^^R. The 
resulting optimally reduced .abse c«^nsisted of nine of the , 
predictors. "Course level," "age," and "title level" were . 
the best three predictors out of the initial battery of 12 
(as in the other reductions), but the order of the less 
important predi tors Wc? s altered. 

When the five grading variables were added to the 
optimally reduced subset of predictors of "Item 5," the 
multiple correlation increased to .45, F (13, 2346) = 44. §7, 
£ « .001. The correction for shrinkage did not lower this " 
R. This increase (from the smallest R found in this study ' 
to the largest) was, of course, very significant," 

81 ' ^ 
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F (4, 2346) = 120.71, £ « .001. 

It should be noted that for all four pairs of 
regression analyses, the reported F-values computed for 
the i ncreases in the multiple R' s (on account of the 
addition of the grading variables) were based on the 
multiple R's after shrinkage . This is conservative 
(slightly understating the F-value) , but in this study it 
resulted in no practical differences, since all of the 
F-values were so highly significant. 

Siimmary 

Factor analyses of the eight-item rating instrument 
showed that there was essentially one factor underlying the 
UCUSI ratings data. This led to the choice of the average 
of all eight items on the UCRSI as the criterion variable 
to represent that factor. Multiple regression analyses 
3^ielded low but highly significant multiple correlations. 
Moreover, they showed that the addition of the grading 
variables to the optimally reduced subset of other 
predictors of ratings did indeed make a highly significant 
improvement in the predictive efficiency of the regression 
equation. Furthermore, practically identical results T"^ere 
obtained for three other criterion variables. 
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. CM^J^^TER V 
DI ON Aim CONCLUSIONS 

This chapter prt^ents a discussion of the results of 
^-'-^is study and explores several implications of those 

^^ults. This chapter also provides some recommendations 
in light of certain conclusions based on those results and 
implications. 

Discussion of Results and Implications 

The results of this study apparently provide a unique 
contribution to the literature on the research question: 
What is the influence of the grades students receive on 
their ratin^gs of the college teachers who gave them those 
grades? This study has used multivariate techniques, and 
has studied a very large sample of course sections across 
an entire university. Previous studies have studied 
tjrpically only one or a few course sections, or have 
suffered from several other inadequacies (see chapter I). 
This study did not suffer from those inadequacies, and thus 
the results are probably more defi* itive and generalizable. 
Furthermore, these results provide more up-to-date knowledge 
in view of certain changes that have occurred in rating and 
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grading pra^ uices over the past decade (see chapter I) . 

The primary implication of the results of this study 
is that ther3 Ls^ an interrelationship between grades and. 
ratings. It would «eam that both the direction and the 
extent of the hjrpothesized "grac_.ng bias" have been 
demonstrated. That is, students apparently tend to rate 
lower those ceachers from whom they receive lower grades 
and vice versa, However, in light of the fact that the 
grading variables accounted for only about. 9% of the 
variance in ratings in this study, other factors (valid or 
not) must also be influe^\clng the students' ratings. In 
other words, students did not simply "payoff" their teachers 
with ratings in direct proportion to the grades they 
received from those teachers. Rather, the students were 
biased or influenced by their grades, overall, in such a 
way that permitted other considerations also to be involve<i 
in the rating process. 

Possibly some of the students did strictly "payoff" 
their teachers for grades re<:eived, while others were not 
at all influenced by their grades. On the other hand, it 
is possible that most or all of the students' ratings are 
merely "shifted" by the grades received. That is, stu^.^^nt.s 
might rate teachc s more or less validly, but plus or minus 
a certain amount according to the grades received. Without 
the ability to pair indivic^ual student grades and ratings 
(given the anon3miity of ratings) , it could not be determined 
whether some students were significantly more influenced by 
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their grades than were other students. This would be an 
appropriate question for further research to answer 
conclusively, but the previous studies that have used 
paired data have often found essentially the same degree 
of relationship between grades and ratings, in spite of 
certain inadequacies or faults in those studies. 

Overall average ratings are often the measure vised by 
administrators and department chaiimen for making decisions 
about faculty pay, promotion, and tenure, and this study has 
shown the probable existence of an overall grading bias. 
To some extent, it is irrelevant what the differences are 
between individual students, when :' t corues to the ^ding 
bias. It does not even matter whether the h ..as iscious 
or subconscious, so long as it is actually depf:n.- e:*> t upon 
grades. Unl ss one could develop a way to count only the 
ratings of those students who are not biased by ^^;radrcj, or 
unless a conscious grading bias were subi^^x to elimination 
more than an unconscious bias, then the overall gr^d-lng bias 
will exist whatever its makeup might be. 

Even though the rate c ~ return wj: the rati.ng;=5 used in 
this study was only 557a, it has bee.i about 55% ror mi\ny 
years, and these are t'^^ ratings ;^bal: rrt. syst^r: :ically 
used for making administrative decit^'^.ons about hz cu] ty 
careers. Tha ^ is, even though returned ratings may aot be 
ge:ieralizable across all of the studcrif:s the f.'c\/lty member 
taught (since students who return ratings may net consti.t^Jte 
a random sample so far as their opinions are c ncev\,^d) , 
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such ratings are commonly used as though they re 
representative of overall student opinion. The results 
of this study should be generalizable to the many other 
rating systems operating with similar rates of return. 
It remains for futur. - ;'i,:;earchers to determine whether an 
appreciably differeni race of eturn would have any 
influence on the relationships fouvid in this study, but 
these results are applicable to rating 'Systems as they are 
commonly used . 

Also, to the extent that the course sections and 
rating procedures at the University of Connecticut are 
representative of tnose across the nation, the. results of 
this study are generalizable. The relationship between 
grades and ratings probably varies from one institution 
to another becaus^^ of differences in the students, faculty, 
grading systems, rating instruments, and rating procedures. 
For example, the timing of ■ ^,tings (before or after final 
grades are awarded) is an iir ' tant difference between 
rating systems, even though previous research (e.g., Bausell 
& Magocn, 1972a; Holmes, 1971) indicates that an e xpected 
grade bias probably exists before the actual grade i^ 
determined. Perhaps future researchers could use similar 
rating procedures at a large number of institutions with 
comparable grading practices to find ^ut how generalizable 
the results of this study are. 

It is possible that there are other variables which, 
when added to the final regression equation in this study, 
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would lower or eliminate the Importance of the grading 
variables as predictors of ratings. That is, it is possible 
that grades and ratings correlate without there being any 
causal relatic/iship between them, and that ratings actually 
depend completely on some other urknown factor or factors. 
This is doubtful, however. The simple and multiple 
correlat-? ons found in this study indicate a definite, if 
partial, j.elationship. Furthermore, the very high 
significance levels of these simp"le~^Trd multiple 
correlations suggest that the results are not attributable 
to chance variation. Also, certain previous findings (see 
especially Holmes , 1972) provide strong (experimental) 
e" ience of a causal relationship between grades and 
ratings. 

The results of this study would have been even more 
definitive, had the optimally reduced subset of the initial 
battery of predictors accounted for a larger amount of the 
variance in ratings. This would have lowered the ch ncea 
that any other variable exists which would acccant T c:c3t 
of the variance in the ratings (thus suggesting th rr . i 
present results might be spurious) . The more criterion 
variance "partialled c by the optimc^lly reduced subset, 
th ! more significant the increase attributable to the 
grading variables x;culd have been. 

Ideally, nearly all of the Vc^r^ ^nce in ratings would 
1 attributable to students' reliable peiceptions of their 
teachers' performances (Tref finger aid Feldhusen, 1970). 
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If the grading bias were about the only exception to this 
rule, it might explain why only about 6% of the criterion 
variance could be accounted for by the 10 predictors in the 
optimally reduced subset. That is, maybe there are no 
predictors of ratings more significant than those foxind in 
this study (outside of other variables which might tap the 
students' opinions about their teachers in some other way). 

Even with the grading variables added in, the final 
regression equati jn in this study was able to acco^Jnt for 
only about 15% of the criterion variance. With such a 
large portion of the variance left unaccounted for, the 
chfn-es are theoretically greater that some variable could 
disprove the existence of what seems to be the grading bias. 
This writer doubts the existence of any such variable. 
Most likely there are few accurate predictors of student 
ratings of faculty performance. 

. One possibility is that student ratings are almost 
totally invalid as measures of el'fective teaching anyway, 
and relate 'T5ore to student and faculty personality 
interactions. Similarly, peer ratings may be measures of 
personality conflicts or even secondhand student ratings. 
Furthermore, Tzithout a consensual definition of uxfectivc 
teaching (or the purposes of education for that matter), 
even unbiased judgments may not be highly related to each 
other . 

The results of studies by Treffinger and Feldhusen 
(ii/7C) and by Bausej.1 and Magoon >;i972b) indicate that 
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students' first impressions and preconceptions are 
persistent and may be the best predictors of end-of -course 
ratings. Thus, it is possible that researchers should be 
looking for predictors of such first impressions or 
preconceptions. Perhaps the grades students receive 
constitute the major influence on ratings after the initial 
opinion is formed. 

Also, to the extent that there is a large portion of 
error variance in the ratings data, the importance of the 
grading bias found in this study looms even larger. That 
is, if the reliability of the ratings is considerably lower 
than 1.00, the- the grading variables would account for more 
than 9% of the systematic variance. For example, if the 
reliability were .75, then the percentage of the s" stematic 
variance accounted for by the grading variables would be 
12% (.09 / .75). Unfortunately, there is no acurate 
estimate of exactly how reliable student ratings are, nor 
even how reliability would best be defined. But for the 
purposes of this study, it is conservaiJ ve to assume that 
the reliability is nearly 1.00 and that the grading bias 
is no moi ■ significant :-.han indicated by the findings. 

Several of the details of the esults of this study 
are worthy of mention. Notably, Che xe^ults of the factor 
an&lyses of the eight-item rating instriment (see Tab'' b 5 
and 6) are interesting in their own right, beyond thei^ 
utility -r t- alection of criteria. Whatever the g' 
impression of teachers m-iy depend upon, it was apparently 
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measured in the same way by the rating instrument in 1972 
and 1973. The single factors and tae correlations between 
items for the two years are strikingly similar. It is also 
interesting to note the order of the sizes of the factor 
^ loadings. "Knowledge of Subject" had the lowest factor 

loading of all eight, and "Over-All Summary as a Teacher" 
had the highest loading. 

The finding of only one factor suggests that a halo 
effect is present, and, as suggested by Hoyt (1969) and by 
Widlak et al. (1973), the separate items may be of little 
diagnostic value. This writer doubts that the high 
correlations found between items represent any "true" 
relationships between the eight traits that the instrument 
attempts to measure. Rather, there seems to be a halo 
effect confounding the meaning of the se arate items. It 
is porsiblr that cert^^in 'profile effects" (indicative of 
differences across the rating items) may exist, even though 
ma5;ked by "he halo effect. However, the diagnostic value 
of such "high inference" items (very general and open to 
varying interpretatirns) is questionable an3way, even if 
there were no confounding influences. It is not at all 
clear exactly what a teacher should do to improve his 
ratings on such global traits. 

Student ratings, therefore, may uo{. be improving 
teaching in the expected way. Spe ific, but- glob^"'. , items 
zL^^dJLtcLiLly can not help the teacher improve his teaching. 
In fact, with a grading bi^s p' isent , ratings almost 
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assuredly act counter to the larger educational goals. 
That is, while a full range of grades may be educationall"' 
appropriate, the grading bias would act to diminish the 
apparent effectiveness of teachers who do give out some low 
grades, and it would act to reward the lenient teachers who 
give out higher and less discriminating grades. 

Of course, one optimistic explanation for a grading 
bias is that many high grades accompanied by high ratings 
in a course section might reflect superior teaching or a 
highly successful class (perhaps using "contract grading"). 
However, this writer doubts that such is the general case, 
E.Ht ^rnal criteria, such as standardized achievement tests, 
would be needed to substantiate any claim that such high 
grades are ^deserved, especially in light c the grading 
trend over the past several years (see chapter I), 

It was r y-pothesize^ that certain departments at this 
university were more lenient (in the awarding of grades) 
than others. Specifically, it was thought that the "hard" 
science and mathematical departments were giving out lower 
grades than other departments, and that the faculty members 
in those highly quantitative departments mighr be suffering 
from lower ratings. The co7 relation of -,15 found between 
average grade and department quantitativeness sup'^orts such 
a theory, as does the correlation of -,05 between ratings 
and department quantitativeness. Of course these 
correlations do not prove any causality of the relationships 
in addition, the size of the correlations suggests th^^'c the 
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relationships are slight. 

Nevertheless, department quantitativeness was a 
significant predictor of ratings until the grading variables 
were allowed to enter the regression equation; then its 
predictive potential was subsumed by the grading variables. 
It would seem that tbe variable "department quantitativeness 
was providing some information about grades indirectly 
(because of the relationship between department 
quantitativeness and grades) . However, when "better" 
information about grades (the grading variables themselves) 
entered the regression equation, then department 
quantitativeness was of no further importance in predicting 
ratings. The same situation occurred wluh the variable 
"course level/' which was the best predi :tor of ratings 
before the grading variables were considered. The F-values 
in Tables 9 and 10 demonstrate the extent o^ these variables 
loss in predictive potential. 

It is possible that certain departments are suffering 
more from the grading bias than others on account of the 
differences in grading practices. Moreover, it is likely 
Lr.at some individual teachers are suffering more than others 
according to their grade distributions. The teachers with 
tlia Tu^o^ lenient grade distributions are probably not always 
the best teachers. Thus, the grading bias lowers the 
probability that ratings could be valid as meast^-res of 
teaching effectiveness. 

As suggested earlier, the effect of grades on ratings 
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well may be only partial, i.e., one of several influences 
(see discussion of first impressions above). The simple 
correlation between grades and ratings found in this study 
was not high (.35), but it was very highly significant 
because of the large sample size. Many previous ntudiss, 
in spite of serious shortcomings (see chapter ^) , also found 
significant correlations of this approximate magnitude. 
Typically the correlations did not exceea .30 (Costin t 
al. , ''971). The results of this study support the findings 
of the large portion of the previous research which found 
such correlations. Thus, this correlation of .35 between 
grades and ratings, and the similar correlations found by 
other researchers, could be quite accurate and indicate that 
the '*true" relationship between grades and ratings probably 
lies in the vicinity of .30 to .35. 

The correlation of .41 found between average grade and 
Item 5 ("Fairness in Marking*') indicates that this item is 
especially sensitive tc :he grading bias. Students 
evidently consider higher grades as "fair." This is not 
very surprising. One might expect thrt a rating item like 
"Fairness in Marking" would receive th brunt of the grading 
bias. Actually, however, tl>:- /^^rading bias apparently 
affects al iight items (see correic^uions among average 
grade and Items 1-8 in Table 8). Furthermore, the 
correlation is highest (.43) for Item 6 ("Attitude Toward 
Student") . Students apparently consider gr;ides awarded as 
a primary indication of their teachers' atti.tud-^s tc^ard 
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students . 

While Items 5 and 6 seem to be the most sensitive of 
the eight items to the grading bias (having the highest 
correlations with all five of the grading variables), other 
considerations must also enter into the studercs' ratings 
even of these two items. When Item 5 by itself was the 
criterion variable (see Parallel Results Using Other 
Criteria), the grading variables stil\. managed to account 
for only about 17% of the criterion variance, thus, the 
bias is partial even for Items 5 and 6, but it is pervasive 
across all eif' ^ items. This finding supports the above 
mentioned assertion by Bausell and MagoQn (1972b; that 
disappointed students "will tend to deprecate the 
instructor's teaching performance in areas other than his 
Rrading system " (p. 130, emphasis added). 



It would seem that one important influence on student 
ratings of college teachers has bean identified. It is the 
so-called "grading bias," and it apparently accour'^ for 
about 9% of the variance in ratin^cs. Variables which could 
accomt for most of the rest of the variance, however, have' 
not been identified. Student ratings may or may not be 
mostly valid in spite of the grading bias. There is no 
proof either way. It remains for future research to answer 
many such questions raised by the results of this study. 
Even if researchers could find variables which would account - 
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for the rest the variance in the ratings, that would be 
no guarantee that ratings are valid as measures of effective 
teaching. Such variables, however, probably would offer 
substantial clues as to whether or not ratings are valid, 
depena/ng on the identity of those variables. 

Well defined educational goals are needed before 
effective teaching can be defined, and external criteria of 
effective teaching must be well defined before valid 
measures of effectiveness can be firmly established. TKis 
positive, constructive, and difficult work needs to be done.. 

The rt3ults of this study are, therefore, somewhat 
negative. That is, a grading bias has been found which most 
xikely serves to lower the validity of student racings. 
Perhaps )^ositive steps could be taken which would eliminate 

^^the grading bias. If possible, methods should be devised 
which would do just that. One possibility is that ratings 
could be collected very early in the semester. This would 
allow for feedback ^to the teachers in tilne for improvement 
to occur before the end of the semester (assuming tue 
teachers would be responsive and that ratiags would indicate 
desired changes). However, there is still likely to he an 
expected grade bias (Holmes, 1971), and there is still no 
certainty about the validity of ratings even withouc any 

•grading bias. 

The results of this study have demonstrated a grading 
bias for "high inference" rating instrument (containing 
very general questions open to varying interpretations) , but 
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radically different results might occur with so-called "low 
inference scales" (asking more objective questions which 
supposedly would be less subject to halo effects arid biases) 
Such scales may or may not eliminate the grading bias and 
may or may not be valid as measures of effective teaching. 
These facts remain to be determined by future researchers. 

One might suggest that students could be "taught" how 
to rate their teachers more fairly. The evidence of a halo 
effect, however, suggests that students are unable tp 
separate their feelings about grades and teachers, 
especially on "high inference scales." Or it might be 
suggested that> if grading practices were completely fair 
and non-arbitrary, then any bias caused by grades would 
disappear. However, if grades were made fair (i.e/, more 
justly discriminating), it seems that grades would be lower, 
and that this might in turn increase the bias. 

Perhaps the most important conclusion for immediate 
consumption is that", since grades do seem to influence 
stuclent ratings of college teachers, this bias should be 
taken into account whenever one is- interpreting student 
ratings. Administrators, department chairmen, and promotion 
and tenure committees acrosis the nation should remember that 
the grading bias exists and that certain teachers may suffer 
from the bias more than others, depending on the grades.. they 
give out. It seems intuitively obvious that' one should not 
reward teachers for leniency if students are expected to 
work and to achieve. It would be better to reward teachers 
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according to the actual achievement levels reached by their 
students (measured by standardized achievement tests) . 

One policy option might be to drop student ratings 
altpgether. They are costly, and possibly not worth the 
cost, especially since they must be interpreted with caution. 
Some reasons fpr not eliminating student I'atings are: they 
are well established; some institutional prestige is 
associated with their use; many students want them; they at 
least give the appearance of requiring teacher accountability; 
and dropping them could conceivably lower faculty concern 
for effective teaching. On the other hand, it might be 
argued, better teaching and fairer grading would occur 
without the faculty's probable fear of reprisals on ratings 
(Ladas, 1974; '*Too Many A's," 1974). 

This writer recommends that university officials and 
faculties review the above implications, and decide how best 
to serve the educational needs involved* If student ratings 
can not be made objective and valid, if the grading bias can 
not be eliminated, and if student ratings can not be dropped 
outright, then it would seem two possible paths are suggested. 
Either decision makers should ignore the ratings, or they 
should combine them with other measures of teacher 
effectiveness (perhaps achievement tests or indices of the 
amount of work done by the students) in such a way that 
ratings would not encourage teachers to be slack or to 
demand too little. 
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APPENDIX 
c ^ MACRO FLOWCHARTS 

OF THE DATA PROCESStNG STEPS- USED IN THIS STUDY 

Figures 6 through 18 represent the major data 
processing steps required to accxunulate the data for this 
study. Figure 5 provides a key to the symbols used in 
those figures, and the letters and nxomerals within the 
symbols in the figures are the names of the tapes, 
-docxjments, and operations symbolized. Keypunching and 
manual data lookup operations are labeled but not named, 
A brief description of the purpose of each step follows. 

Merge 1 . This first step matched and merged data from 
the "instructor header record'Vtape (one instructor header 
record per. course section evaluated) with data from the 
"professional history file" tape. ^ The merged data were 
output onto both paper and tape, dnd information about 
missing data was also printed on the paper output for use 
in the next step. 

First missing data input . In this step, the 
information obtained in the merge 1 step was used to look 

lip and punch onto cards the' data missing, from the 

, > • • 

professional history file tape. The resulting deck of 
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cards was retained ^r input into the merge 5 step below. 

Merge 2 , This step matched and merged the faculty 
evaluations data with ^he merged output from the merge 1 
step. The individual ratings data were used to compute 
means, N*s, and -standard deviations for each' item on the 
rating scale, and the re'salts were output onto both paper 
and tape. 

Sort 1 . The records on the output tape from the merg^ 
2 step above were sorted by branch, department, coijrse . ^ 
number, and Section nvimber . The sorted records were output 
onto another tape for later use in the merge 3 step. 

Sort 2 . The grade distribution data was sorted as in 
sort 1 above so that the course sections would be in the 
same sequ^ce on both tapes for input into the merge 3 
program. ^. 

Merge 3 . This program matched and merged the grade 
distribution Mata with the other data previously assembled 
for each course section. Computations of average grades 
and standard deviations of grades in the course sections 
were ^ade at this point, and the merged results were output 
onto paper and tape. Also on the paper output was 
information about missing data detected by this program. 

Second missing ^t a. input . The information on missing 
data from merge 3 was used, in this step, to create a deck 
of cards containing that missing data. The cards were 
retained for input into the merge 5 step below. 

Merge 4 . This program matched students' records of 
'\ • ■ ■ 
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their Quantitative and Verbal " Scholastic" Aptitude Test scores 
with their"" declared majors in order to compute for each 
department the variable^ "department quantitativeness. " The 
results of these computations were output onto paper and 
"cards; The deck of cards was retained for input into the 
merge 5 program. 

CO . .... 

Merge 5 > The purpose of this step was to insert the 
data on cards (from previous steps) into the course section 
records, to compute several new variables from old ones (age 
from date of birth for example) , and to edit the data for 
the acceptability of the values. The results were output 
V onto paper and tape, and information about missing data and 
improper values was also printed on the output paper. 

Third missing data input . The information on missing 
or unacceptable data from the merge 5 program was used to 
, create a deck of cards containing the missing data. This 

deck was retained for input into the merge 6 program below, 

\ 

Merge 6 .^ This program was used to ihsert the missing 
data discovered in the merge 5 step into the final data 
'records, which were output onto paper and tape. 

Regression rim 1 . This step accomplished the reduction 
of the initial battery of predictor variables using the 
stepwise multiple regression analysis subprogram of the 

Statistical Package for the Social Sciences (SPSS, Nie et 

• .1 ... ^ 

al., 1970).. The results of this analysis were used to make 
theNnrdered list of variables (the reduced set in the order 
-of thexti inclusion in the regression equation) for input 
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into the next step. 

' ^ • ; b • ■ 

Regression riin 2 . This step was another stepwise 

multiple regression analysis using SPSS as above, but with 

the five grading variables added to the optimally reduced 

subset of predictors found in regression rtan 1. 
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Magnetic Computer Tape 



Paper Document or Computer Output Paper 



Processing Step or Program 



Deck of Computer Cards 



J 



Keyr nching Step 



Manual Operation 




Sorting Operation 



Figure 5. Key to symbols used in Figures 6 through 18, 
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Figure 6. Merge. 1. 
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TAB738 



LOOK UP 
MISSING) 
DATA 



f 



KEYPUNCH 
MISSING 
DATA 



PROFMD 



Figure 7. First missing data input, 
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Figure 8. Merge 2. 
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Figure 9. Sort 1. 
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MER!6E 3 




Figure J.ir" Ijlerge 3. * ^ " 
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LOOK UP 
MISSING i 
DATA 




Figure 12. Second missing data input. 
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VARIABLE 
LIST 1 





REGRESSION 
' RUN 1 



TABRRl 



VARIABLE 
LIST 2 



KEYPUNCH 
VARIABLE 
LIST 3 



VARIABLE 
LIST 3 



Figtire 17. Regression rxin 1. 
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VARIABLE 
LIST 3 




REGRESSION 
RUN 2 



■'■■1 



TABRR2 




Figure 18. Regression run 2. 
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